SPRINGER TEXTS IN STATISTICS 


A Modern 
Introduction 
to Probability 
and Statistics 


Understanding Why 
and How 


F.M. Dekking 
C. Kraaikamp 
) ; H.P. Lopuhaa 
— Sp oS L.E. Meester 


Springer Texts in Statistics 


Advisors: 
George Casella Stephen Fienberg Ingram Olkin 


F.M. Dekking C. Kraaikamp 
H.P. Lopuhaé L.E. Meester 


A Modern Introduction to 
Probability and Statistics 
Understanding Why and How 


With 120 Figures 


2) Springer 


Frederik Michel Dekking 

Cornelis Kraaikamp 

Hendrik Paul Lopuhaa 

Ludolf Erwin Meester 

Delft Institute of Applied Mathematics 
Delft University of Technology 
Mekelweg 4 

2628 CD Delft 

The Netherlands 


Whilst we have made considerable efforts to contact all holders of copyright material contained in this 
book, we may have failed to locate some of them. Should holders wish to contact the Publisher, we 
will be happy to come to some arrangement with them. 


British Library Cataloguing in Publication Data 
A modern introduction to probability and statistics. — 
(Springer texts in statistics) 
1. Probabilities 2. Mathematical statistics 
I. Dekking, F. M. 
519.2 
ISBN 978-1-85233-896-1 


Library of Congress Cataloging-in-Publication Data 
A modern introduction to probability and statistics : understanding why and how / F.M. Dekking ... [et 
al.]. 
p. cm. — (Springer texts in statistics) 
Includes bibliographical references and index. 
ISBN 978-1-85233-896-1 
1. Probabilities—Textbooks. 2. Mathematical statistics—Textbooks. I. Dekking, F.M. II. 
Series. 
QA273.M645 2005 
519.2—dc22 2004057700 


Apart from any fair dealing for the purposes of research or private study, or criticism or review, as 
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, 
stored or transmitted, in any form or by any means, with the prior permission in writing of the publish- 
ers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the 
Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to 
the publishers. 


ISBN 978-1-85233-896-1 


Springer Science+Business Media 
springeronline.com 


© Springer-Verlag London Limited 2005 

The use of registered names, trademarks, etc. in this publication does not imply, even in the absence 
of a specific statement, that such names are exempt from the relevant laws and regulations and therefore 
free for general use. 

The publisher makes no representation, express or implied, with regard to the accuracy of the informa- 


tion contained in this book and cannot accept any legal responsibility or liability for any errors or 
omissions that may be made. 


12/3830/543210 Printed on acid-free paper SPIN 10943403 


Preface 


Probability and statistics are fascinating subjects on the interface between 
mathematics and applied sciences that help us understand and solve practical 
problems. We believe that you, by learning how stochastic methods come 
about and why they work, will be able to understand the meaning of statistical 
statements as well as judge the quality of their content, when facing such 
problems on your own. Our philosophy is one of how and why: instead of just 
presenting stochastic methods as cookbook recipes, we prefer to explain the 
principles behind them. 


In this book you will find the basics of probability theory and statistics. In 
addition, there are several topics that go somewhat beyond the basics but 
that ought to be present in an introductory course: simulation, the Poisson 
process, the law of large numbers, and the central limit theorem. Computers 
have brought many changes in statistics. In particular, the bootstrap has 
earned its place. It provides the possibility to derive confidence intervals and 
perform tests of hypotheses where traditional (normal approximation or large 
sample) methods are inappropriate. It is a modern useful tool one should learn 
about, we believe. 


Examples and datasets in this book are mostly from real-life situations, at 
least that is what we looked for in illustrations of the material. Anybody who 
has inspected datasets with the purpose of using them as elementary examples 
knows that this is hard: on the one hand, you do not want to boldly state 
assumptions that are clearly not satisfied; on the other hand, long explanations 
concerning side issues distract from the main points. We hope that we found 
a good middle way. 


A first course in calculus is needed as a prerequisite for this book. In addition 
to high-school algebra, some infinite series are used (exponential, geometric). 
Integration and differentiation are the most important skills, mainly concern- 
ing one variable (the exceptions, two dimensional integrals, are encountered in 
Chapters 9-11). Although the mathematics is kept to a minimum, we strived 


VI Preface 


to be mathematically correct throughout the book. With respect to probabil- 
ity and statistics the book is self-contained. 


The book is aimed at undergraduate engineering students, and students from 
more business-oriented studies (who may gloss over some of the more mathe- 
matically oriented parts). At our own university we also use it for students in 
applied mathematics (where we put a little more emphasis on the math and 
add topics like combinatorics, conditional expectations, and generating func- 
tions). It is designed for a one-semester course: on average two hours in class 
per chapter, the first for a lecture, the second doing exercises. The material 
is also well-suited for self-study, as we know from experience. 


We have divided attention about evenly between probability and statistics. 
The very first chapter is a sampler with differently flavored introductory ex- 
amples, ranging from scientific success stories to a controversial puzzle. Topics 
that follow are elementary probability theory, simulation, joint distributions, 
the law of large numbers, the central limit theorem, statistical modeling (in- 
formal: why and how we can draw inference from data), data analysis, the 
bootstrap, estimation, simple linear regression, confidence intervals, and hy- 
pothesis testing. Instead of a few chapters with a long list of discrete and 
continuous distributions, with an enumeration of the important attributes of 
each, we introduce a few distributions when presenting the concepts and the 
others where they arise (more) naturally. A list of distributions and their 
characteristics is found in Appendix A. 


With the exception of the first one, chapters in this book consist of three main 
parts. First, about four sections discussing new material, interspersed with a 
handful of so-called Quick exercises. Working these—two-or-three-minute— 
exercises should help to master the material and provide a break from reading 
to do something more active. On about two dozen occasions you will find 
indented paragraphs labeled Remark, where we felt the need to discuss more 
mathematical details or background material. These remarks can be skipped 
without loss of continuity; in most cases they require a bit more mathematical 
maturity. Whenever persons are introduced in examples we have determined 
their sex by looking at the chapter number and applying the rule “He is odd, 
she is even.” Solutions to the quick exercises are found in the second to last 
section of each chapter. 


The last section of each chapter is devoted to exercises, on average thirteen 
per chapter. For about half of the exercises, answers are given in Appendix C, 
and for half of these, full solutions in Appendix D. Exercises with both a 
short answer and a full solution are marked with and those with only a 
short answer are marked with FE] (when more appropriate, for example, in 
“Show that ...” exercises, the short answer provides a hint to the key step). 
Typically, the section starts with some easy exercises and the order of the 
material in the chapter is more or less respected. More challenging exercises 
are found at the end. 


Preface Vil 


Much of the material in this book would benefit from illustration with a 
computer using statistical software. A complete course should also involve 
computer exercises. Topics like simulation, the law of large numbers, the 
central limit theorem, and the bootstrap loudly call for this kind of experi- 
ence. For this purpose, all the datasets discussed in the book are available at 
http://www.springeronline.com/1-85233-896-2. The same Web site also pro- 
vides access, for instructors, to a complete set of solutions to the exercises; 
go to the Springer online catalog or contact textbooks@springer-sbm.com to 
apply for your password. 


Delft, The Netherlands F.M. Dekking 
January 2005 C. Kraaikamp 
H. P. Lopuhaa 


L. E. Meester 


Contents 


3.6 


1 Why probability and statistics? .............. 0.0.00. e eee ee 1 
1:1 Biometry: iris recognition : ... 0.6 eke be eed ee ee bs 1 
1:2 Killer football. 2.eis.cce..00e i. ead oetetbe aegis Go eS 3 
1.3 Cars and goats: the Monty Hall dilemma ................... 4 
1.4 The space shuttle Challenger ......... 0.00.0 c cee ee eee 5 
1.5 Statistics versus intelligence agencies............... 000.000. 7 
1.6: The-speed:of light: 23 cic ec eM ane hea dae ee ance ad 9 
2 Outcomes, events, and probability ......................... 13 
2.1. ample: Spaces i204 4d saad we a es Se ee ae ed 13 
DP) TOVOUUS Ss ences xu oyth ae cu hara sates are dot tard be eters eta eee raid a ee 14 
2.3. Probability 22.622 .0sciaeaeedeee teas eae bed see eae 16 
2.4 Products of sample spaces ... 1.0.00... 00 cc eee eee eee 18 
2.5 An infinite sample space............. 00 cee eee eee eee 19 
2.6 Solutions to the quick exercises.......... 00.00. 21 
20 VX@rCisesS)...cctai ca vader Ga casera nied eauted edess aoe aee ds 21 
3 Conditional probability and independence ................. 25 
3.1 Conditional probability.......... 0.00.0. cee eee 25 
3.2. The multiplication rule.......... 00.0.0... ec es 27 
3.3 The law of total probability and Bayes’ rule................. 30 
3:4 Independence: Paysioyaclis se add bate ee ddeaddedaeete 32 
3.5 Solutions to the quick exercises... ....... 00.00. eee eee 35 


HX€PrGiS6S 0 .ccden Sic netiooda cles dasa is dali aibiadiddd ade ead 37 


Contents 


Discrete random variables ..................0 00 eee eee 41 
A.l Random variables 22.0. cc44 064s ee cea eccae geese eae ea snes 41 
4.2 The probability distribution of a discrete random variable .... 438 
4.3. The Bernoulli and binomial distributions ................... 45 
4.4 The geometric distribution.............. 0.0.0... 48 
4.5 Solutions to the quick exercises.......... 00.00. e eee eee ee 50 
ALO TEX@PrCiS@S: 26.100 ei ok Hee bee hate ee eh oe Ne 51 
Continuous random variables ................ 00.002 eee eee 57 
5.1 Probability density functions.............. 0.0.0. e eee ee eee 57 
5.2. The uniform distribution ............ 0.0.0. eee 60 
5.3 The exponential distribution ................ 0.0.00 e eee 61 
5.4 The Pareto distribution .............. 0.00 eee eee ee 63 
5.5 The normal distribution ........... 00.000 c ee eee eee 64 
5.6. Quarntiles 2.4 scsi bias dae dev anee ea dee aaae wad 65 
5.7 Solutions to the quick exercises... 0.2.0.0... 0.0 c eee eee ee 67 
D.9° -EXCPCIS6S! 24.446 bi ddowtina dena eead pee eed dg aee dead 68 
Simiilationi.. 0c. ee se ceed de Ok elec OR igs Gels entae dasa 71 
6.1 What is simulation? ....... 0.0.0... ccc eee val 
6.2 Generating realizations of random variables ................. 72 
6.3 Comparing two jury rules.... 0.0.0.0... eee eee 75 
6.4 The single-server queue ......... 0... cee eee ee eee cee eens 80 
6.5 Solutions to the quick exercises... ....... 00... eee 84 
6:6 Exercises... s css feu voids thee ee bbe biaweteds etal as 85 
Expectation and variance .............. 0.00 cece ee ees 89 
(oly Expected values: wau'ciet was eae dacs eee nckSead aenaeies 89 
(2 (Threé-examples e044 soe ie pee ee ed ae 93 
7.3 The change-of-variable formula............. 0.0.00... 02000. 94 
HA NOTVOINCS? gh oss tc hae ace eR Gah ch Dany dae an die Bk bean arin eedae daethonicd 96 
7.5 Solutions to the quick exercises.......... 000.0: eee eee 99 
TiO TEXOrcis@s) ices coun acts HH eS es Os OOPS daediackeada 99 
Computations with random variables ...................... 103 
8.1 Transforming discrete random variables .................... 103 
8.2 Transforming continuous random variables.................. 104 


8.3 Jensen’s inequality ......... 0... eee eee 106 


10 


11 


12 


13 


Contents XI 


8.4: BXtreMeSic.o.ia bens eed ee eG a ied pede eee Ge dew eed gees 108 
8.5 Solutions to the quick exercises..............0 000 eee eee eee 110 
8.6-- EXCrciSeS i. 00434. cee encee ad eee tae bode ee Ales eh ae ee eed 111 
Joint distributions and independence ...................... 115 
9.1 Joint distributions of discrete random variables.............. 115 
9.2 Joint distributions of continuous random variables ........... 118 
9.3 More than two random variables ................0.00 0000 ee 122 
9.4 Independent random variables..............0..0 cee eee eee 124 
9.5 Propagation of independence............ 00.00. e eee eee eee ee 125 
9.6 Solutions to the quick exercises.......... 000.0. 126 
OL) TEX6rcis@S) creas cour tern ose ood eed eee da dek Coad 127 
Covariance and correlation ................. 00: c eee eee eee 135 
10.1 Expectation and joint distributions .....................00. 135 
10:2: ‘Covariatice’ «i244 i leia ieee Wate ee ener nae Cee eon nats 138 
10.3 The correlation coefficient ...........0 0... e cece eee eee 141 
10.4 Solutions to the quick exercises.............0 000 e eee eee eee 143 
10s5> HX@rciSeS: #22 ced idowd Sead Ge beak wh ie Oe eS eae 144 
More computations with more random variables........... 151 
11.1 Sums of discrete random variables .................-.--000 151 
11.2 Sums of continuous random variables ................-.004 154 
11.3 Product and quotient of two random variables .............. 159 
11.4 Solutions to the quick exercises.............0 000. e eee ee eee 162 
11.5: FK@rGises 3 eii4d Stee teed pane bbe oe eae dad eas 163 
The Poisson process ............ 0.00 cece cee tee eenenes 167 
12.1. Rand om pits; << sg.cdce o adiack's b's cet eel sen ec gens seein a 167 
12.2 Taking a closer look at random arrivals..................0.. 168 
12.3 The one-dimensional Poisson process............0000+00 eee 171 
12.4 Higher-dimensional Poisson processes .......... 000000000 eee 173 
12.5 Solutions to the quick exercises............ 00.00 c eee ee eee 176 
12:6 ER@PCISES ic otp ca eee eee hee eee a ee a Bae Bee eae a 176 
The law of large numbers ................ 00.020 e eee ee eee 181 
13.1 Averages vary less... 0... eee ce eee es 181 


13.2 Chebyshev’s inequality ........... 0... e eee eee eee ee 183 


XII 


14 


15 


16 


17 


Contents 
13.3 The law of large numbers............. eee eee eee eee eee 185 
13.4 Consequences of the law of large numbers .................. 188 
13.5 Solutions to the quick exercises... ........0 00. 191 
13.6 EX@PGIS@S! sis chac-aawe gee de ye ad et kde Soden 191 
The central limit theorem ...................0 0000 e eee eee ee 195 
14.1 Standardizing averages ......... 00... eee eee eee 195 
14.2 Applications of the central limit theorem ................... 199 
14.3 Solutions to the quick exercises... ....... 00.00. ee eee 202 
TAA EXOPCISES:. oie ose aed cee Gad Gee Rae oe dave ew dies eR ae 203 
Exploratory data analysis: graphical summaries............ 207 
15.1 Example: the Old Faithful data .......................00-. 207 
1d.2 Histograms: ¢ 63 ccs sian. t ad cede cee Ce dee en deen ese gedae onda 209 
15.3 Kernel density estimates............. 00. c eee ee eee eee 212 
15.4 The empirical distribution function ....................004. 219 
15:5: Scatterplot. . so a2. s.dee) tee stead oad ee codes 221 
15.6 Solutions to the quick exercises............ 00.0. c eee eee eee 225 
1D. EXOrCiseS .s5 5 cet wne eas OON a a he dee OE eS eee awk CLES 226 
Exploratory data analysis: numerical summaries ........... 231 
16.1 The center of a dataset ......... 00.0.0 eee ee eee 231 
16.2 The amount of variability of a dataset...................... 233 
16.3 Empirical quantiles, quartiles, and the IQR ................. 234 
16.4 The box-and-whisker plot .......... 00.0... cee eee eee 236 
16.5 Solutions to the quick exercises............ 00.0002 eee ee eee 238 
16:6° Exercises: 204 ced eddie d oot ne eee ne dee aadiedaddaaeeads 240 
Basic statistical models ............. 20.0... cece eee 245 
17.1 Random samples and statistical models .................00. 245 
17.2 Distribution features and sample statistics .................. 248 
17.3 Estimating features of the “true” distribution ............... 253 
17.4 The linear regression model.............. 00.0 eee eee ee eee 256 
17.5 Solutions to the quick exercises.............. 000. e eee ee eee 259 


1750 HXOrCis@S)c.c04 aga nied ee teed hte a ee eal we ds ae a he Pas wed 259 


18 


19 


20 


21 


22 


Contents XIUI 


The. bootstfap .4...0¢ 25 c.cesa peeves sey al caves ad epied as eeee ns 269 
18.1 The bootstrap principle ........ 20... 0... eee 269 
18.2 The empirical bootstrap ............ 0000 c eee eee 272 
18.3 The parametric bootstrap ...........0 0.0 c eee eee 276 
18.4 Solutions to the quick exercises............ 0.000 e eee eee eee 279 
13:5: EXCL CiseS: siccs-is auton aa OVS Oe GEM Ed eS case Sa a 280 
Unbiased estimators ..........0.0. 00. cece ete nes 285 
19.1 Fistimators? a:4c0cd ee nate boae eeaee sed ed ee tee eae bas 285 
19.2 Investigating the behavior of an estimator .................. 287 
19.3 The sampling distribution and unbiasedness ................ 288 
19.4 Unbiased estimators for expectation and variance............ 292 
19.5 Solutions to the quick exercises..............0 000 eee eee eee 294 
1956: TUXCrCISCS-aisai. id ates eon Me a a eet Gaede Reon diesels 294 
Efficiency and mean squared error .................-2-0+05: 299 
20.1 Estimating the number of German tanks ................... 299 
20.2 Variance of an estimator ........ 0.0... 0 cece eee eens 302 
20.3 Mean squared error......... 0... cee cece eee eee eee ees 305 
20.4 Solutions to the quick exercises............ 0.0. e eee eee eee 307 
AU) Oi FEXCLCISCS: a:2et od Sicha af eness ee dense decked Suse hed uaaiak tines, datanes Are 307 
Maximum likelihood .............. 0.0.0.0 cece eee 313 
21.1 Why a general principle? ........ 0.0.0... 00.00. eee eee ee, 313 
21.2 The maximum likelihood principle ......................05. 314 
21.3 Likelihood and loglikelihood ............ 0.0.0.0 00 0.0 eee 316 
21.4 Properties of maximum likelihood estimators................ 321 
21.5 Solutions to the quick exercises........ 0.0... 322 
21:6: EXerCisGs: sited cated Gio beled Gee a ek eS 323 
The method of least squares ............... 2.0.00 00 ee eens 329 
22.1 Least squares estimation and regression .................05. 329 
22,2. Residuals: 2.2 wss.cnta ase tna ctet ho iia é PAS 1 ake oe tad teeeeds 332 
22.3 Relation with maximum likelihood......................04. 335 
22.4 Solutions to the quick exercises.......... 00.000 eee eee eee ee 336 


22:5) ERKOTCISES: -iisc00 6c aea ed ected eat deb ag aed eh Gee Med A NG kde Ld 337 


XIV 


23 


24 


25 


26 


27 


Contents 
Confidence intervals for the mean .....................0-4 341 
23.1 General principle .......... 00... cee een nes 341 
23.2. Normal Gata) asides ieenek iene eae de neds dediak ddioks 345 
23.3 Bootstrap confidence intervals.... 2.2.0.0... 0.00 e eee eee 350 
23 A: War ee- SAMPlES ts x.cictts.y diakin h Adidaielarnd a Darauderes waasa deal eeiade te 353 
23.5 Solutions to the quick exercises.......... 00.00 ce eee ee eee ee 355 
23:6 EX@PCIS6S: 244 tied notden tard deta ead Pee auedd § trae as 356 
More on confidence intervals................. 0.0.0 e eee eee 361 
24.1 The probability of success ......... 00.00. c eee eee eee 361 
24.2 Is there a general method?.......... 00.00... eee eee 364 
24.3 One-sided confidence intervals............0 0.00 e eee ee 366 
24.4 Determining the sample size ...... 0.0... 0c eee eee 367 
24.5 Solutions to the quick exercises... 2.2.0.0... 0. eee eee 368 
24:6 TX6PCIS@S octet 8.4 etd SO ee ee ee eA 369 
Testing hypotheses: essentials..................0 0002 eee eee 373 
25.1 Null hypothesis and test statistic ..............0 000. e eee eee 373 
25:2 “Tail. probabilities; otc sve cet Ss eke d edad ebes Seeha ea 376 
25.3 Type I and type IT errors.......... 2.0... cece nee 377 
25.4 Solutions to the quick exercises.......... 000. c eee ee eee 379 
20:0: EX€LGISGS cA.cbuadiiiaadebahdeioabeed oe eee dees 380 
Testing hypotheses: elaboration.......................0-05 383 
26.1, (Signiiicance level a: -.i0si aad tarde dedarddaaie eth ae ates 383 
26.2 Critical region and critical values ................00.00 0000. 386 
26.3 ‘Lype HH err0r’ w464.0.04.4¢e04nd Sena coined tivated) tease da 390 
26.4 Relation with confidence intervals .............0.0.00 00000 392 
26.5 Solutions to the quick exercises........ 00.0.0: eee eee 393 
20.0. EXCICISCS cM each acd dee tieard Sa eek a sccve aeenrn a cea ay eer arn 394 
Phe F-test 2 cccty cd ek he Glee eh bh Gee eens pads eteh a ete 399 
27.1 Monitoring the production of ball bearings.................. 399 
27.2 The one-sample #test ....... 0... eee ee eee 401 
27.3 The ttest in a regression setting............ 00... ce eee eee ee 405 
27.4 Solutions to the quick exercises.........0.0 0.0. e eee eee ee 409 


DilcD: - Hi RCECISCS. ia. se dead a dea hard daca deactivated oirhaeerh Back 8 AAs 410 


Contents XV 


28 Comparing two samples ................ 00: c eee eee 415 
28.1 Is dry drilling faster than wet drilling? ..................0.. 415 
28.2 Two samples with equal variances .............0.0000 ee eee 416 
28.3 Two samples with unequal variances ................0-0005- 419 
25.4 large: sampless's iis. natdandp sida deadaien wad wedadleta danke 422 
28.5 Solutions to the quick exercises... 0.2.2.0... 00. eee eee ee 424 
28.6 EX@rCiS6S).4 ics dead ceutagad Geta ee eee need § taeda, 424 

A Summary of distributions ................... 00.00. eee ee eee 429 

B_ Tables of the normal and ¢-distributions ................... 431 

C Answers to selected exercises ............ 00 eens 435 

D_ Full solutions to selected exercises ...............0.0000 000s 445 

ReterenGes oc. dado oie a kode ia de ee eRe ade dea eaedws Raw s 475 

List::of symbols! ecco se sien ered hag wee eg hy Oe eed eae ATT 


1 


Why probability and statistics? 


Is everything on this planet determined by randomness? This question is open 
to philosophical debate. What is certain is that every day thousands and 
thousands of engineers, scientists, business persons, manufacturers, and others 
are using tools from probability and statistics. 


The theory and practice of probability and statistics were developed during 
the last century and are still actively being refined and extended. In this book 
we will introduce the basic notions and ideas, and in this first chapter we 
present a diverse collection of examples where randomness plays a role. 


1.1 Biometry: iris recognition 


Biometry is the art of identifying a person on the basis of his or her personal 
biological characteristics, such as fingerprints or voice. From recent research 
it appears that with the human iris one can beat all existing automatic hu- 
man identification systems. Iris recognition technology is based on the visible 
qualities of the iris. It converts these—via a video camera—into an “iris code” 
consisting of just 2048 bits. This is done in such a way that the code is hardly 
sensitive to the size of the iris or the size of the pupil. However, at different 
times and different places the iris code of the same person will not be exactly 
the same. Thus one has to allow for a certain percentage of mismatching bits 
when identifying a person. In fact, the system allows about 34% mismatches! 
How can this lead to a reliable identification system? The miracle is that dif- 
ferent persons have very different irides. In particular, over a large collection 
of different irides the code bits take the values 0 and 1 about half of the time. 
But that is certainly not sufficient: if one bit would determine the other 2047, 
then we could only distinguish two persons. In other words, single bits may 
be random, but the correlation between bits is also crucial (we will discuss 
correlation at length in Chapter 10). John Daugman who has developed the 
iris recognition technology made comparisons between 222743 pairs of iris 


2 1 Why probability and statistics? 


codes and concluded that of the 2048 bits 266 may be considered as uncor- 
related ((6]). He then argues that we may consider an iris code as the result 
of 266 coin tosses with a fair coin. This implies that if we compare two such 
codes from different persons, then there is an astronomically small probability 
that these two differ in less than 34% of the bits—almost all pairs will differ 
in about 50% of the bits. This is illustrated in Figure 1.1, which originates 
from [6], and was kindly provided by John Daugman. The iris code data con- 
sist of numbers between 0 and 1, each a Hamming distance (the fraction of 
mismatches) between two iris codes. The data have been summarized in two 
histograms, that is, two graphs that show the number of counts of Hamming 
distances falling in a certain interval. We will encounter histograms and other 
summaries of data in Chapter 15. One sees from the figure that for codes from 
the same iris (left side) the mismatch fraction is only about 0.09, while for 
different irides (right side) it is about 0.46. 


fo) 
a4 
- fo) 
fo) 
| r& 
i) DECISION ENVIRONMENT | “ 
7 ° 
oS FOR IRIS RECOGNITION [| 8 
@ | ice) 
9 | 
2 ° 
R = 
ee] 222,743 comparisons of different iris pairs + 
= é beg + a 
oo] 546 comparisons of same iris pairs 
Oe fo 
| {=} 
o | i Lo 
re} mean = 0.089 mean = 0.456 i=) 
4 4 stnd dev = 0.042 || stnd dev = 0.018 
} ° 
So | d’ = 11.36 | o 
: | a 
4 4 } Theoretical curves: binomial family 
1 Theoretical cross-over point: HD = 0.342 2 
°o | i a) 
= Theoretical cross-over rate: 1 in 1.2 million g 
o —— i Lo 


0.0 8 0.1 02 #03 04 05 O06 O07 O08 O09 1.0 
Hamming Distance 


Fig. 1.1. Comparison of same and different iris pairs. 


Source: J.Daugman. Second IMA Conference on Image Processing: Mathe- 
matical Methods, Algorithms and Applications, 2000. © Ellis Horwood Pub- 
lishing Limited. 


You may still wonder how it is possible that irides distinguish people so well. 
What about twins, for instance? The surprising thing is that although the 
color of eyes is hereditary, many features of iris patterns seem to be pro- 
duced by so-called epigenetic events. This means that during embryo develop- 
ment the iris structure develops randomly. In particular, the iris patterns of 
(monozygotic) twins are as discrepant as those of two arbitrary individuals. 


1.2 Killer football 3 


For this reason, as early as in the 1930s, eye specialists proposed that iris 
patterns might be used for identification purposes. 


1.2 Killer football 


A couple of years ago the prestigious British Medical Journal published a 
paper with the title “Cardiovascular mortality in Dutch men during 1996 
European football championship: longitudinal population study” ([41]). The 
authors claim to have shown that the effect of a single football match is 
detectable in national mortality data. They consider the mortality from in- 
farctions (heart attacks) and strokes, and the “explanation” of the increase is 
a combination of heavy alcohol consumption and stress caused by watching 
the football match on June 22 between the Netherlands and France (lost by 
the Dutch team!). The authors mainly support their claim with a figure like 
Figure 1.2, which shows the number of deaths from the causes mentioned (for 
men over 45), during the period June 17 to June 27, 1996. The middle horizon- 
tal line marks the average number of deaths on these days, and the upper and 
lower horizontal lines mark what the authors call the 95% confidence inter- 
val. The construction of such an interval is usually performed with standard 
statistical techniques, which you will learn in Chapter 23. The interpretation 
of such an interval is rather tricky. That the bar on June 22 sticks out off the 
confidence interval should support the “killer claim.” 


40 


30 


Deaths 
bo 
ce 


June 18 June 22 June 26 


Fig. 1.2. Number of deaths from infarction or stroke in (part of) June 1996. 


It is rather surprising that such a conclusion is based on a single football 
match, and one could wonder why no probability model is proposed in the 
paper. In fact, as we shall see in Chapter 12, it would not be a bad idea to 
model the time points at which deaths occur as a so-called Poisson process. 


4 1 Why probability and statistics? 


Once we have done this, we can compute how often a pattern like the one in the 
figure might occur—without paying attention to football matches and other 
high-risk national events. To do this we need the mean number of deaths per 
day. This number can be obtained from the data by an estimation procedure 
(the subject of Chapters 19 to 23). We use the sample mean, which is equal to 
(10 - 27.2 + 41)/11 = 313/11 = 28.45. (Here we have to make a computation 
like this because we only use the data in the paper: 27.2 is the average over 
the 5 days preceding and following the match, and 41 is the number of deaths 
on the day of the match.) Now let pnigh be the probability that there are 
41 or more deaths on a day, and let pusuay be the probability that there are 
between 21 and 34 deaths on a day—here 21 and 34 are the lowest and the 
highest number that fall in the interval in Figure 1.2. From the formula of the 
Poisson distribution given in Chapter 12 one can compute that pnign = 0.008 
and pusual = 0.820. Since events on different days are independent according 
to the Poisson process model, the probability p of a pattern as in the figure is 


5 5 
P= Pusual * Phigh * Pusual = 0.0011. 


From this it can be shown by (a generalization of) the law of large numbers 
(which we will study in Chapter 13) that such a pattern would appear about 
once every 1/0.0011 = 899 days. So it is not overwhelmingly exceptional to 
find such a pattern, and the fact that there was an important football match 
on the day in the middle of the pattern might just have been a coincidence. 


1.3 Cars and goats: the Monty Hall dilemma 


On Sunday September 9, 1990, the following question appeared in the “Ask 
Marilyn” column in Parade, a Sunday supplement to many newspapers across 
the United States: 


Suppose you’re on a game show, and you're given the choice of three 
doors; behind one door is a car; behind the others, goats. You pick a 
door, say No. 1, and the host, who knows what’s behind the doors, 
opens another door, say No. 3, which has a goat. He then says to you, 
“Do you want to pick door No. 2?” Is it to your advantage to switch 
your choice?—Craig F. Whitaker, Columbia, Md. 


Marilyn’s answer—one should switch—caused an avalanche of reactions, in to- 
tal an estimated 10000. Some of these reactions were not so flattering (“You 
are the goat”), quite a lot were by professional mathematicians (“You blew 
it, and blew it big,” “You are utterly incorrect .... How many irate mathe- 
maticians are needed to change your mind?” ). Perhaps some of the reactions 
were so strong, because Marilyn vos Savant, the author of the column, is in 
the Guinness Book of Records for having one of the highest IQs in the world. 


1.4 The space shuttle Challenger 5 


The switching question was inspired by Monty Hall’s “Let’s Make a Deal” 
game show, which ran with small interruptions for 23 years on various U.S. 
television networks. 


Although it is not explicitly stated in the question, the game show host will 
always open a door with a goat after you make your initial choice. Many 
people would argue that in this situation it does not matter whether one 
would change or not: one door has a car behind it, the other a goat, so the 
odds to get the car are fifty-fifty. To see why they are wrong, consider the 
following argument. In the original situation two of the three doors have a 
goat behind them, so with probability 2/3 your initial choice was wrong, and 
with probability 1/3 it was right. Now the host opens a door with a goat (note 
that he can always do this). In case your initial choice was wrong the host has 
only one option to show a door with a goat, and switching leads you to the 
door with the car. In case your initial choice was right the host has two goats 
to choose from, so switching will lead you to a goat. We see that switching 
is the best strategy, doubling our chances to win. To stress this argument, 
consider the following generalization of the problem: suppose there are 10 000 
doors, behind one is a car and behind the rest, goats. After you make your 
choice, the host will open 9998 doors with goats, and offers you the option to 
switch. To change or not to change, that’s the question! Still not convinced? 
Use your Internet browser to find one of the zillion sites where one can run a 
simulation of the Monty Hall problem (more about simulation in Chapter 6). 


In fact, there are quite a lot of variations on the problem. For example, the 
situation that there are four doors: you select a door, the host always opens a 
door with a goat, and offers you to select another door. After you have made 
up your mind he opens a door with a goat, and again offers you to switch. 
After you have decided, he opens the door you selected. What is now the best 
strategy? In this situation switching only at the last possible moment yields 
a probability of 3/4 to bring the car home. Using the law of total probability 
from Section 3.3 you will find that this is indeed the best possible strategy. 


1.4 The space shuttle Challenger 


On January 28, 1986, the space shuttle Challenger exploded about one minute 
after it had taken off from the launch pad at Kennedy Space Center in Florida. 
The seven astronauts on board were killed and the spacecraft was destroyed. 
The cause of the disaster was explosion of the main fuel tank, caused by flames 
of hot gas erupting from one of the so-called solid rocket boosters. 


These solid rocket boosters had been cause for concern since the early years 
of the shuttle. They are manufactured in segments, which are joined at a later 
stage, resulting in a number of joints that are sealed to protect against leakage. 
This is done with so-called O-rings, which in turn are protected by a layer 
of putty. When the rocket motor ignites, high pressure and high temperature 


6 1 Why probability and statistics? 


build up within. In time these may burn away the putty and subsequently 
erode the O-rings, eventually causing hot flames to erupt on the outside. In a 
nutshell, this is what actually happened to the Challenger. 


After the explosion, an investigative commission determined the causes of the 
disaster, and a report was issued with many findings and recommendations 
([24]). On the evening of January 27, a decision to launch the next day had 
been made, notwithstanding the fact that an extremely low temperature of 
31°F had been predicted, well below the operating limit of 40°F set by Morton 
Thiokol, the manufacturer of the solid rocket boosters. Apparently, a “man- 
agement decision” was made to overrule the engineers’ recommendation not 
to launch. The inquiry faulted both NASA and Morton Thiokol management 
for giving in to the pressure to launch, ignoring warnings about problems with 
the seals. 


The Challenger launch was the 24th of the space shuttle program, and we 
shall look at the data on the number of failed O-rings, available from previous 
launches (see [5] for more details). Each rocket has three O-rings, and two 
rocket boosters are used per launch, so in total six O-rings are used each 
time. Because low temperatures are known to adversely affect the O-rings, 
we also look at the corresponding launch temperature. In Figure 1.3 the dots 
show the number of failed O-rings per mission (there are 23 dots—one time the 
boosters could not be recovered from the ocean; temperatures are rounded to 
the nearest degree Fahrenheit; in case of two or more equal data points these 
are shifted slightly.). If you ignore the dots representing zero failures, which 
all occurred at high temperatures, a temperature effect is not apparent. 


6 
5 
4 
n 
o 
pan 
=. 3 
3S 
& 
2 e 
1 
0 efood oo o 
rs rs r,s es Ss | 
30 40 50 60 70 80 90 


Launch temperature in °F 


Source: based on data from Volume VI of the Report of the Presidential 
Commission on the space shuttle Challenger accident, Washington, DC, 1986. 


Fig. 1.3. Space shuttle failure data of pre-Challenger missions and fitted model of 
expected number of failures per mission function. 


1.5 Statistics versus intelligence agencies 7 


In a model to describe these data, the probability p(t) that an individual 
O-ring fails should depend on the launch temperature t. Per mission, the 
number of failed O-rings follows a so-called binomial distribution: six O-rings, 
and each may fail with probability p(t); more about this distribution and the 
circumstances under which it arises can be found in Chapter 4. A logistic 
model was used in [5] to describe the dependence on t: 

eatht 


MO = Ty garor 


A high value of a + b- ¢ corresponds to a high value of p(t), a low value to 
low p(t). Values of a and b were determined from the data, according to the 
following principle: choose a and 6 so that the probability that we get data as 
in Figure 1.3 is as high as possible. This is an example of the use of the method 
of maximum likelihood, which we shall discuss in Chapter 21. This results in 
a = 5.085 and b = —0.1156, which indeed leads to lower probabilities at higher 
temperatures, and to p(31) = 0.8178. We can also compute the (estimated) 
expected number of failures, 6- p(t), as a function of the launch temperature ¢; 
this is the plotted line in the figure. 

Combining the estimates with estimated probabilities of other events that 
should happen for a complete failure of the field-joint, the estimated proba- 
bility of such a failure is 0.023. With six field-joints, the probability of at least 
one complete failure is then 1 — (1 — 0.023)® = 0.13! 


1.5 Statistics versus intelligence agencies 


During World War II, information about Germany’s war potential was essen- 
tial to the Allied forces in order to schedule the time of invasions and to carry 
out the allied strategic bombing program. Methods for estimating German 
production used during the early phases of the war proved to be inadequate. 
In order to obtain more reliable estimates of German war production, ex- 
perts from the Economic Warfare Division of the American Embassy and the 
British Ministry of Economic Warfare started to analyze markings and serial 
numbers obtained from captured German equipment. 

Each piece of enemy equipment was labeled with markings, which included 
all or some portion of the following information: (a) the name and location 
of the marker; (b) the date of manufacture; (c) a serial number; and (d) 
miscellaneous markings such as trademarks, mold numbers, casting numbers, 
etc. The purpose of these markings was to maintain an effective check on 
production standards and to perform spare parts control. However, these same 
markings offered Allied intelligence a wealth of information about German 
industry. 

The first products to be analyzed were tires taken from German aircraft shot 
over Britain and from supply dumps of aircraft and motor vehicle tires cap- 
tured in North Africa. The marking on each tire contained the maker’s name, 


8 1 Why probability and statistics? 


a serial number, and a two-letter code for the date of manufacture. The first 
step in analyzing the tire markings involved breaking the two-letter date code. 
It was conjectured that one letter represented the month and the other the 
year of manufacture, and that there should be 12 letter variations for the 
month code and 3 to 6 for the year code. This, indeed, turned out to be true. 
The following table presents examples of the 12 letter variations used by four 
different manufacturers. 


Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 


Dunlop T I E B R A P O L N U 
Fulda F U L D A M U N S T E 
Phoenix F O N I Xx H A M B U R 
Sempirit A  B Cc D E F G 4H I J K 


FaQaD 


Reprinted with permission from “An empirical approach to economic intelli- 

gence” by R.Ruggles and H.Brodie, pp.72-91, Vol. 42, No. 237. © 1947 by 

the American Statistical Association. All rights reserved. 
For instance, the Dunlop code was Dunlop Arbeit spelled backwards. Next, 
the year code was broken and the numbering system was solved so that for 
each manufacturer individually the serial numbers could be dated. Moreover, 
for each month, the serial numbers could be recoded to numbers running 
from 1 to some unknown largest number N, and the observed (recoded) serial 
numbers could be seen as a subset of this. The objective was to estimate N 
for each month and each manufacturer separately by means of the observed 
(recoded) serial numbers. In Chapter 20 we discuss two different methods 
of estimation, and we show that the method based on only the maximum 
observed (recoded) serial number is much better than the method based on 
the average observed (recoded) serial numbers. 


With a sample of about 1400 tires from five producers, individual monthly 
output figures were obtained for almost all months over a period from 1939 
to mid-1943. The following table compares the accuracy of estimates of the 
average monthly production of all manufacturers of the first quarter of 1943 
with the statistics of the Speer Ministry that became available after the war. 
The accuracy of the estimates can be appreciated even more if we compare 
them with the figures obtained by Allied intelligence agencies. They estimated, 
using other methods, the production between 900 000 and 1 200000 per month! 


Type of tire Estimated production Actual production 
Truck and passenger car 147 000 159 000 
Aircraft 28 500 26 400 
Total 175 500 186 100 


Reprinted with permission from “An empirical approach to economic intelli- 
gence” by R.Ruggles and H.Brodie, pp.72-91, Vol. 42, No. 237. @ 1947 by 
the American Statistical Association. All rights reserved. 


1.6 The speed of light 9 


1.6 The speed of light 


In 1983 the definition of the meter (the SI unit of one meter) was changed to: 
The meter is the length of the path traveled by light in vacuum during a time 
interval of 1/299 792 458 of a second. This implicitly defines the speed of light 
as 299 792 458 meters per second. It was done because one thought that the 
speed of light was so accurately known that it made more sense to define the 
meter in terms of the speed of light rather than vice versa, a remarkable end 
to a long story of scientific discovery. For a long time most scientists believed 
that the speed of light was infinite. Early experiments devised to demonstrate 
the finiteness of the speed of light failed because the speed is so extraordi- 
narily high. In the 18th century this debate was settled, and work started on 
determination of the speed, using astronomical observations, but a century 
later scientists turned to earth-based experiments. Albert Michelson refined 
experimental arrangements from two previous experiments and conducted a 
series of measurements in June and early July of 1879, at the U.S. Naval 
Academy in Annapolis. In this section we give a very short summary of his 
work. It is extracted from an article in Statistical Science ([18]). 


The principle of speed measurement is easy, of course: measure a distance and 
the time it takes to travel that distance, the speed equals distance divided by 
time. For an accurate determination, both the distance and the time need 
to be measured accurately, and with the speed of light this is a problem: 
either we should use a very large distance and the accuracy of the distance 
measurement is a problem, or we have a very short time interval, which is also 
very difficult to measure accurately. 

In Michelson’s time it was known that the speed of light was about 300000 
km/s, and he embarked on his study with the goal of an improved value of the 
speed of light. His experimental setup is depicted schematically in Figure 1.4. 
Light emitted from a light source is aimed, through a slit in a fixed plate, 
at a rotating mirror; we call its distance from the plate the radius. At one 
particular angle, this rotating mirror reflects the beam in the direction of a 
distant (fixed) flat mirror. On its way the light first passes through a focusing 
lens. This second mirror is positioned in such a way that it reflects the beam 
back in the direction of the rotating mirror. In the time it takes the light to 
travel back and forth between the two mirrors, the rotating mirror has moved 
by an angle a, resulting in a reflection on the plate that is displaced with 
respect to the source beam that passed through the slit. The radius and the 


displacement determine the angle a because 
displacement 
tan 2a = ——_—_—_ 

radius 


and combined with the number of revolutions per seconds (rps) of the mirror, 
this determines the elapsed time: 


a/2n 
rps — 


time = 


10 1 Why probability and statistics? 


~< Distance — 


Focusing Fixed 
lens mirror 


Rotating 
mirror 


Radius 


Displacement 


Light source 


Fig. 1.4. Michelson’s experiment. 


During this time the light traveled twice the distance between the mirrors, so 
the speed of light in air now follows: 


2 - distance 
Cair a 
time 

All in all, it looks simple: just measure the four quantities—distance, radius, 
displacement and the revolutions per second—and do the calculations. This 
is much harder than it looks, and problems in the form of inaccuracies are 
lurking everywhere. An error in any of these quantities translates directly into 
some error in the final result. 


Michelson did the utmost to reduce errors. For example, the distance between 
the mirrors was about 2000 feet, and to measure it he used a steel measuring 
tape. Its nominal length was 100 feet, but he carefully checked this using a 
copy of the official “standard yard.” He found that the tape was in fact 100.006 
feet. This way he eliminated a (small) systematic error. 


Now imagine using the tape to measure a distance of 2000 feet: you have to use 
the tape 20 times, each time marking the next 100 feet. Do it again, and you 
probably find a slightly different answer, no matter how hard you try to be 
very precise in every step of the measuring procedure. This kind of variation 
is inevitable: sometimes we end up with a value that is a bit too high, other 
times it is too low, but on average we’re doing okay—assuming that we have 
eliminated sources of systematic error, as in the measuring tape. Michelson 
measured the distance five times, which resulted in values between 1984.93 
and 1985.17 feet (after correcting for the temperature-dependent stretch), and 
he used the average as the “true distance.” 


In many phases of the measuring process Michelson attempted to identify 
and determine systematic errors and subsequently applied corrections. He 


1.6 The speed of light 11 


also systematically repeated measuring steps and averaged the results to re- 
duce variability. His final dataset consists of 100 separate measurements (see 
Table 17.1), but each is in fact summarized and averaged from repeated mea- 


surements on several var 
of light in vacuum (this 


iables. The final result he reported was that the speed 
involved a conversion) was 299 944 + 51 km/s, where 


the 51 is an indication of the uncertainty in the answer. In retrospect, we must 
conclude that, in spite of Michelson’s admirable meticulousness, some source 
of error must have slipped his attention, as his result is off by about 150 km/s. 
With current methods we would derive from his data a so-called 95% confi- 


dence interval: 299 944 4 
analysis was a little con: 


t 15.5 km/s, suggesting that Michelson’s uncertainty 
servative. The methods used to construct confidence 


intervals are the topic of Chapters 23 and 24. 


2 


Outcomes, events, and probability 


The world around us is full of phenomena we perceive as random or unpre- 
dictable. We aim to model these phenomena as outcomes of some experiment, 
where you should think of experiment in a very general sense. The outcomes 
are elements of a sample space Q, and subsets of 2 are called events. The events 
will be assigned a probability, a number between 0 and 1 that expresses how 
likely the event is to occur. 


2.1 Sample spaces 


Sample spaces are simply sets whose elements describe the outcomes of the 
experiment in which we are interested. 


We start with the most basic experiment: the tossing of a coin. Assuming that 
we will never see the coin land on its rim, there are two possible outcomes: 
heads and tails. We therefore take as the sample space associated with this 
experiment the set Q = {H,T}. 

In another experiment we ask the next person we meet on the street in which 
month her birthday falls. An obvious choice for the sample space is 


Q = {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec}. 


In a third experiment we load a scale model for a bridge up to the point 
where the structure collapses. The outcome is the load at which this occurs. 
In reality, one can only measure with finite accuracy, e.g., to five decimals, and 
a sample space with just those numbers would strictly be adequate. However, 
in principle, the load itself could be any positive number and therefore Q = 
(0,00) is the right choice. Even though in reality there may also be an upper 
limit to what loads are conceivable, it is not necessary or practical to try to 
limit the outcomes correspondingly. 


14 2 Outcomes, events, and probability 


In a fourth experiment, we find on our doormat three envelopes, sent to us by 
three different persons, and we look in which order the envelopes lie on top of 
each other. Coding them 1, 2, and 3, the sample space would be 


Q = {123, 132, 213, 231, 312, 321}. 


QUICK EXERCISE 2.1 If we received mail from four different persons, how 
many elements would the corresponding sample space have? 


In general one might consider the order in which n different objects can be 
placed. This is called a permutation of the n objects. As we have seen, there 
are 6 possible permutations of 3 objects, and 4-6 = 24 of 4 objects. What 
happens is that if we add the nth object, then this can be placed in any of n 
positions in any of the permutations of n — 1 objects. Therefore there are 


n:-(n—1)--+-3-2-Ll=n! 


possible permutations of n objects. Here n! is the standard notation for this 
product and is pronounced “n factorial.” It is convenient to define 0! = 1. 


2.2 Events 


Subsets of the sample space are called events. We say that an event A occurs 
if the outcome of the experiment is an element of the set A. For example, in 
the birthday experiment we can ask for the outcomes that correspond to a 
long month, i.e., a month with 31 days. This is the event 


L = {Jan, Mar, May, Jul, Aug, Oct, Dec}. 


Events may be combined according to the usual set operations. 


For example if R is the event that corresponds to the months that have the 
letter r in their (full) name (so R = {Jan, Feb, Mar, Apr, Sep, Oct, Nov, Dec}), 
then the long months that contain the letter r are 


LO R= {Jan, Mar, Oct, Dec}. 


The set DNR is called the intersection of L and R and occurs if both Z and R 
occur. Similarly, we have the union AUB of two sets A and B, which occurs if 
at least one of the events A and B occurs. Another common operation is taking 
complements. The event A° = {w € Q: w ¢ A} is called the complement of A; 
it occurs if and only if A does not occur. The complement of 2 is denoted 
), the empty set, which represents the impossible event. Figure 2.1 illustrates 
these three set operations. 


2.2 Events 15 


Intersection AN B Union AUB Complement A° 


Fig. 2.1. Diagrams of intersection, union, and complement. 


We call events A and B disjoint or mutually exclusive if A and B have no 
outcomes in common; in set terminology: ANB = @. For example, the event L 
“the birthday falls in a long month” and the event {Feb} are disjoint. 


Finally, we say that event A implies event B if the outcomes of A also lie 
in B. In set notation: A C B; see Figure 2.2. 


Some people like to use double negations: 
“Tt is certainly not true that neither John nor Mary is to blame.” 


This is equivalent to: “John or Mary is to blame, or both.” The following 
useful rules formalize this mental operation to a manipulation with events. 


DEMORGAN’S LAWS. For any two events A and B we have 


(AU B)* = A°n BS and (AN B)* = AC UB*. 


QUICK EXERCISE 2.2 Let J be the event “John is to blame” and M the event 
“Mary is to blame.” Express the two statements above in terms of the events 
J, J°,M, and M°, and check the equivalence of the statements by means of 
DeMorgan’s laws. 


Disjoint sets A and B A subset of B 


Fig. 2.2. Minimal and maximal intersection of two sets. 


16 2 Outcomes, events, and probability 


2.3 Probability 


We want to express how likely it is that an event occurs. To do this we will 
assign a probability to each event. The assignment of probabilities to events is 
in general not an easy task, and some of the coming chapters will be dedicated 
directly or indirectly to this problem. Since each event has to be assigned a 
probability, we speak of a probability function. It has to satisfy two basic 
properties. 


DEFINITION. A probability function P on a finite sample space 2 
assigns to each event A in 2 a number P(A) in [0,1] such that 

(i) P(Q) =1, and 

(ii) P(AU B) = P(A) + P(B) if A and B are disjoint. 

The number P(A) is called the probability that A occurs. 


Property (i) expresses that the outcome of the experiment is always an element 
of the sample space, and property (ii) is the additivity property of a probability 
function. It implies additivity of the probability function over more than two 
sets; e.g., if A, B, and C are disjoint events, then the two events AU B and 
C are also disjoint, so 


P(AU BUC) = P(AUB)+P(C) = P(A) +P(B)+P(C). 


We will now look at some examples. When we want to decide whether Peter 
or Paul has to wash the dishes, we might toss a coin. The fact that we consider 
this a fair way to decide translates into the opinion that heads and tails are 
equally likely to occur as the outcome of the coin-tossing experiment. So we 
put 


P({H}) = PU{T}) = 5. 


Formally we have to write {H} for the set consisting of the single element H, 
because a probability function is defined on events, not on outcomes. From 
now on we shall drop these brackets. 


Now it might happen, for example due to an asymmetric distribution of the 
mass over the coin, that the coin is not completely fair. For example, it might 
be the case that 

P(#) = 0.4999 and P(T) = 0.5001. 


More generally we can consider experiments with two possible outcomes, say 
“failure” and “success”, which have probabilities 1 — p and p to occur, where 
p is a number between 0 and 1. For example, when our experiment consists 
of buying a ticket in a lottery with 10000 tickets and only one prize, where 
“success” stands for winning the prize, then p = 10~*. 

How should we assign probabilities in the second experiment, where we ask 
for the month in which the next person we meet has his or her birthday? In 
analogy with what we have just done, we put 


2.3 Probability 17 


1 
P(Jan) = P(Feb) a P(Dec) = 7° 
Some of you might object to this and propose that we put, for example, 
3l 30 
P(Jan) = — d P(Apr) = — 
et) mga tne: EBD aes 


because we have long months and short months. But then the very precise 
among us might remark that this does not yet take care of leap years. 


QUICK EXERCISE 2.3 If you would take care of the leap years, assuming that 
one in every four years is a leap year (which again is an approximation to 
reality!), how would you assign a probability to each month? 


In the third experiment (the buckling load of a bridge), where the outcomes are 
real numbers, it is impossible to assign a positive probability to each outcome 
(there are just too many outcomes!). We shall come back to this problem in 
Chapter 5, restricting ourselves in this chapter to finite and countably infinite! 
sample spaces. 


In the fourth experiment it makes sense to assign equal probabilities to all six 
outcomes: 


1 
P(123) = P(132) = P(213) = P(231) = P(312) = P(321) = S 
Until now we have only assigned probabilities to the individual outcomes of the 
experiments. To assign probabilities to events we use the additivity property. 
For instance, to find the probability P(T) of the event T that in the three 
envelopes experiment envelope 2 is on top we note that 
P(T) = P(213) + P(231) = : + a 
= 6 6 3 
In general, additivity of P implies that the probability of an event is obtained 
by summing the probabilities of the outcomes belonging to the event. 


QUICK EXERCISE 2.4 Compute P(Z) and P(R) in the birthday experiment. 


Finally we mention a rule that permits us to compute probabilities of events 
A and B that are not disjoint. Note that we can write A = (ANB) U(ANB*), 
which is a disjoint union; hence 


P(A) = P(AN B)+P(AN B®). 


If we split AU B in the same way with B and B‘, we obtain the events 
(AU B)NB, which is simply B and (AUB) B*, which is nothing but AN BS. 


' This means: although infinite, we can still count them one by one; Q = 
{w1,w2,...}. The interval [0,1] of real numbers is an example of an uncountable 
sample space. 


18 2 Outcomes, events, and probability 


Thus 
P(AUB) =P(B)+P(AN B®). 


Eliminating P(A NM B°) from these two equations we obtain the following rule. 


THE PROBABILITY OF A UNION. For any two events A and B we 


have 
P(AU B) = P(A) + P(B) — P(AN B). 


From the additivity property we can also find a way to compute probabilities 
of complements of events: from AU A° = 2, we deduce that 


P(A°) =1—P(A). 


2.4 Products of sample spaces 


Basic to statistics is that one usually does not consider one experiment, but 
that the same experiment is performed several times. For example, suppose 
we throw a coin two times. What is the sample space associated with this new 
experiment? It is clear that it should be the set 


Q = {H,T} x {H,T} = {(A, A), (4,7), (T, A), (T,T)}. 


If in the original experiment we had a fair coin, ie., P(H) = P(T), then in 
this new experiment all 4 outcomes again have equal probabilities: 
1 
P((H, H)) = P((H,T)) = P((Z, H)) = P((Z,T)) = 5. 
Somewhat more generally, if we consider two experiments with sample spaces 
Q, and Q2 then the combined experiment has as its sample space the set 


Q=0;, x Q2= {(w1, w2) [WE O41, we € Q>2}. 


If Q, has r elements and Q2 has s elements, then Q; x Q2 has rs elements. 
Now suppose that in the first, the second, and the combined experiment all 
outcomes are equally likely to occur. Then the outcomes in the first experi- 
ment have probability 1/r to occur, those of the second experiment 1/s, and 
those of the combined experiment probability 1/rs. Motivated by the fact that 
1/rs = (1/r) x (1/s), we will assign probability pip; to the outcome (w;,w;) 
in the combined experiment, in the case that w; has probability p; and w; has 
probability p; to occur. One should realize that this is by no means the only 
way to assign probabilities to the outcomes of a combined experiment. The 
preceding choice corresponds to the situation where the two experiments do 
not influence each other in any way. What we mean by this influence will be 
explained in more detail in the next chapter. 


2.5 An infinite sample space 19 


QUICK EXERCISE 2.5 Consider the sample space {a1, a2, a3, @4, a5, a6} of some 
experiment, where outcome a; has probability p; for i = 1,...,6. We perform 
this experiment twice in such a way that the associated probabilities are 


P((a;,ai)) =pi, and P((aj,a;))=0 if iA, fori,j=1,...,6. 


Check that P is a probability function on the sample space 0 = {a1,..., a6} 
{a1,...,a¢} of the combined experiment. What is the relationship between 
the first experiment and the second experiment that is determined by this 
probability function? 


We started this section with the experiment of throwing a coin twice. If we 
want to learn more about the randomness associated with a particular exper- 
iment, then we should repeat it more often, say n times. For example, if we 
perform an experiment with outcomes 1 (success) and 0 (failure) five times, 
and we consider the event A “exactly one experiment was a success,” then 
this event is given by the set 


A= {(0,0,0,0, 1), (0,0, 0,1, 0), (0,0, 1, 0,0), (0, 1,0, 0,0), (1,0, 0,0, 0)} 


in Q = {0,1} x {0,1} x {0,1} x {0,1} x {0,1}. Moreover, if success has 
probability p and failure probability 1 — p, then 


P(A) =5-(1—p)*-p, 
since there are five outcomes in the event A, each having probability (1—p)*-p. 


QUICK EXERCISE 2.6 What is the probability of the event B “exactly two 
experiments were successful” ? 


In general, when we perform an experiment n times, then the corresponding 
sample space is 
Q=Qy x ON. &--- K On, 


where Q; fori = 1,...,n is a copy of the sample space of the original exper- 
iment. Moreover, we assign probabilities to the outcomes (w1,...,wW,) in the 
standard way described earlier, i.e., 


P((w1,We,---,;Wn)) = Pi: Pa-*** Dns 


if each w; has probability p;. 


2.5 An infinite sample space 


We end this chapter with an example of an experiment with infinitely many 
outcomes. We toss a coin repeatedly until the first head turns up. The outcome 


20 2 Outcomes, events, and probability 


of the experiment is the number of tosses it takes to have this first occurrence 
of a head. Our sample space is the space of all positive natural numbers 


Q = {1,2,3,...}. 


What is the probability function P for this experiment? 

Suppose the coin has probability p of falling on heads and probability 1—p to 
fall on tails, where 0 < p < 1. We determine the probability P(n) for each n. 
Clearly P(1) = p, the probability that we have a head right away. The event 
{2} corresponds to the outcome (T, H) in {H,T} x {H,T}, so we should have 


P(2) = (1—p)p. 


Similarly, the event {n} corresponds to the outcome (T,T,...,7,T7, H) in the 
space {H,T} x --. x {H,T}. Hence we should have, in general, 


P(n) =(1—p)"""p, n=1,2,3,.... 


Does this define a probability function on 2 = {1,2,3,...}? Then we should 
at least have P(Q) = 1. It is not directly clear how to calculate P((): since 
the sample space is no longer finite we have to amend the definition of a 
probability function. 


DEFINITION. A probability function on an infinite (or finite) sample 
space 2 assigns to each event A in 2 a number P(A) in [0,1] such 
that 
(3) 1P(Q) = 1, eumel 
(ii) P(A; U Ag U Ag U---) = P(Ay) + P(Ag) + P(A3) +=: 

if Ay, Ao, A3,... are disjoint events. 


Note that this new additivity property is an extension of the previous one 


because if we choose A3 = Ay =--: = 0, then 
P(A, U Az) = P(A, U Ap UO UD U---) 
= P(A;) + P(A2) +04+0+4+---=P(A1) + P(A2). 
Now we can compute the probability of Q: 
P(Q) = P(1) + P(2) +---+P(n)+4+--:- 

pt-pt 

=p Spa py bas], 
The sum 1+ (1—p)+---+(1—p)"1+--- is an example of a geometric 
series. It is well known that when |1 — p| < 1, 

1+ (Lp) +e py a a 


Therefore we do indeed have P(Q) = p--=1. 


1 
P 


2.7 Exercises 21 


QUICK EXERCISE 2.7 Suppose an experiment in a laboratory is repeated every 
day of the week until it is successful, the probability of success being p. The 
first experiment is started on a Monday. What is the probability that the 
series ends on the next Sunday? 


2.6 Solutions to the quick exercises 


2.1 The sample space is Q = {1234, 1243, 1324, 1342,...,4321}. The best way 
to count its elements is by noting that for each of the 6 outcomes of the three- 
envelope experiment we can put a fourth envelope in any of 4 positions. Hence 
Q has 4-6 = 24 elements. 


2.2 The statement “It is certainly not true that neither John nor Mary is to 
blame” corresponds to the event (J° M°)*°. The statement “John or Mary is 
to blame, or both” corresponds to the event J UM. Equivalence now follows 
from DeMorgan’s laws. 


2.3 In four years we have 365 x 3+366 = 1461 days. Hence long months each 
have a probability 4 x 31/1461 = 124/1461, and short months a probability 
120/1461 to occur. Moreover, {Feb} has probability 113/1461. 


664.99 


2.4 Since there are 7 long months and 8 months with an “r” in their name, 
we have P(L) = 7/12 and P(R) = 8/12. 


2.5 Checking that P is a probability function Q amounts to verifying that 
0 < P((a;,a;)) <1 for all i and j and noting that 


6 6 


6 
P(Q) = pS P((a;,a;)) =) P((ai,ai)) = So pi = 1. 


i=l i=l 


The two experiments are totally coupled: one has outcome a, if and only if 
the other has outcome a;. 


2.6 Now there are 10 outcomes in B (for example (0,1,0,1,0)), each having 
probability (1 — p)?p?. Hence P(B) = 10(1 — p)p?. 


2.7 This happens if and only if the experiment fails on Monday,..., Saturday, 
and is a success on Sunday. This has probability p(1 — p)® to happen. 


2.7 Exercises 


2.1 F Let A and B be two events in a sample space for which P(A) = 2/3, 
P(B) = 1/6, and P(AN B) = 1/9. What is P(AU B)? 


22 2 Outcomes, events, and probability 


2.2 Let E and F be two events for which one knows that the probability that 
at least one of them occurs is 3/4. What is the probability that neither E nor 
F occurs? Hint: use one of DeMorgan’s laws: E°N F° = (EU F)°. 


2.3 Let C and D be two events for which one knows that P(C) = 0.3, P(D) = 
0.4, and P(C’'N D) = 0.2. What is P(C¢N D)? 


2.4 ©] We consider events A, B, and C, which can occur in some experiment. 
Is it true that the probability that only A occurs (and not B or C) is equal 
to P(AU BUC) — P(B) — P(C) + P(BNC)? 


2.5 The event AM B® that A occurs but not B is sometimes denoted as A\ B. 
Here \ is the set-theoretic minus sign. Show that P(A \ B) = P(A) — P(B) if 
B implies A, i.e., if BC A. 

2.6 When P(A) = 1/3, P(B) = 1/2, and P(AU B) = 3/4, what is 


a. P(ANB)? 
b. P(A°U B°)? 


2.7 LE) Let A and B be two events. Suppose that P(A) = 0.4, P(B) = 0.5, and 
P(AN B) =0.1. Find the probability that A or B occurs, but not both. 


2.8 H Suppose the events D, and D2 represent disasters, which are rare: 
P(D,) < 10~® and P(Dz) < 10~°. What can you say about the probability 
that at least one of the disasters occurs? What about the probability that 
they both occur? 


2.9 We toss a coin three times. For this experiment we choose the sample 
space 


Q = {HHH,THH,HTH,HHT,TTH,THT,HTT,TTT} 
where T stands for tails and H for heads. 


a. Write down the set of outcomes corresponding to each of the following 
events: 


A: “we throw tails exactly two times.” 

B: “we throw tails at least two times.” 

C: “tails did not appear before a head appeared.” 
D: “the first throw results in tails.” 


b. Write down the set of outcomes corresponding to each of the following 
events: A°, AU(C ND), and AN D*. 


2.10 In some sample space we consider two events A and B. Let C be the 
event that A or B occurs, but not both. Express C' in terms of A and B, using 


9 Obs 


only the basic operations “union,” “intersection,” and “complement.” 


2.7 Exercises 23 


2.11 © An experiment has only two outcomes. The first has probability p to 
occur, the second probability p?. What is p? 


2.12 & In the UEFA Euro 2004 playoffs draw 10 national football teams 
were matched in pairs. A lot of people complained that “the draw was not 
fair,” because each strong team had been matched with a weak team (this 
is commercially the most interesting). It was claimed that such a matching 
is extremely unlikely. We will compute the probability of this “dream draw” 
in this exercise. In the spirit of the three-envelope example of Section 2.1 
we put the names of the 5 strong teams in envelopes labeled 1,2,3,4, and 
5 and of the 5 weak teams in envelopes labeled 6,7,8,9, and 10. We shuffle 
the 10 envelopes and then match the envelope on top with the next envelope, 
the third envelope with the fourth envelope, and so on. One particular way 
a “dream draw” occurs is when the five envelopes labeled 1,2,3,4,5 are in 
the odd numbered positions (in any order!) and the others are in the even 
numbered positions. This way corresponds to the situation where the first 
match of each strong team is a home match. Since for each pair there are 
two possibilities for the home match, the total number of possibilities for the 
“dream draw” is 2° = 32 times as large. 


a. An outcome of this experiment is a sequence like 4, 9,3,7,5,10,1,8, 2,6 of 
labels of envelopes. What is the probability of an outcome? 


b. How many outcomes are there in the event “the five envelopes labeled 
1,2,3,4,5 are in the odd positions—in any order, and the envelopes la- 
beled 6,7,8,9,10 are in the even positions—in any order”? 


c. What is the probability of a “dream draw”? 


2.13 In some experiment first an arbitrary choice is made out of four pos- 
sibilities, and then an arbitrary choice is made out of the remaining three 
possibilities. One way to describe this is with a product of two sample spaces 
{a, b,c, d}: 

Q = {a,b, c,d} x {a, b,c, d}. 


a. Make a 4x4 table in which you write the probabilities of the outcomes. 


b. Describe the event “c is one of the chosen possibilities” and determine its 
probability. 


2.14 Consider the Monty Hall “experiment” described in Section 1.3. The 
door behind which the car is parked we label a, the other two b and c. As the 
sample space we choose a product space 


Q = {a,b,c} x {a,b,c}. 


Here the first entry gives the choice of the candidate, and the second entry 
the choice of the quizmaster. 


24 2 Outcomes, events, and probability 


a. Make a 3x3 table in which you write the probabilities of the outcomes. 
N.B. You should realize that the candidate does not know that the car 
is in a, but the quizmaster will never open the door labeled a because he 
knows that the car is there. You may assume that the quizmaster makes 
an arbitrary choice between the doors labeled 6 and c, when the candidate 
chooses door a. 


b. Consider the situation of a “no switching” candidate who will stick to his 
or her choice. What is the event “the candidate wins the car,” and what 
is its probability? 

c. Consider the situation of a “switching” candidate who will not stick to 
her choice. What is now the event “the candidate wins the car,” and what 
is its probability? 


2.15 The rule P(AU B) = P(A) + P(B) —P(AN B) from Section 2.3 is often 
useful to compute the probability of the union of two events. What would be 
the corresponding rule for three events A, B, and C? It should start with 


P(AU BUC) = P(A) + P(B) + P(C) 


Hint: you could use the sum rule suitably, or you could make a diagram as in 
Figure 2.1. 


2.16 4 Three events £,F, and G cannot occur simultaneously. Further it 
is known that P(ENF) = P(F NG) = P(ENG) = 1/3. Can you deter- 
mine P(E)? 

Hint: if you try to use the formula of Exercise 2.15 then it seems that you do 
not have enough information; make a diagram instead. 


2.17 A post office has two counters where customers can buy stamps, etc. 
If you are interested in the number of customers in the two queues that will 
form for the counters, what would you take as sample space? 


2.18 In a laboratory, two experiments are repeated every day of the week in 
different rooms until at least one is successful, the probability of success be- 
ing p for each experiment. Supposing that the experiments in different rooms 
and on different days are performed independently of each other, what is the 
probability that the laboratory scores its first successful experiment on day n? 


2.19 LJ We repeatedly toss a coin. A head has probability p, and a tail prob- 
ability 1 — p to occur, where 0 < p < 1. The outcome of the experiment we 
are interested in is the number of tosses it takes until a head occurs for the 
second time. 


a. What would you choose as the sample space? 
b. What is the probability that it takes 5 tosses? 


3 


Conditional probability and independence 


Knowing that an event has occurred sometimes forces us to reassess the prob- 
ability of another event; the new probability is the conditional probability. If 
the conditional probability equals what the probability was before, the events 
involved are called independent. Often, conditional probabilities and indepen- 
dence are needed if we want to compute probabilities, and in many other 
situations they simplify the work. 


3.1 Conditional probability 


In the previous chapter we encountered the events L, “born in a long month,” 
and R, “born in a month with the letter r.” Their probabilities are easy to 
compute: since L = {Jan, Mar, May, Jul, Aug, Oct, Dec} and R = {Jan, Feb, 
Mar, Apr, Sep, Oct, Nov, Dec}, one finds 


7 8 
P(L)=55 and P(R)= =. 


Now suppose that it is known about the person we meet in the street that 
he was born in a “long month,” and we wonder whether he was born in 
a “month with the letter r.” The information given excludes five outcomes 
of our sample space: it cannot be February, April, June, September, or 
November. Seven possible outcomes are left, of which only four—those in 
ROL = {Jan, Mar, Oct, Dec}—are favorable, so we reassess the probability 
as 4/7. We call this the conditional probability of R given L, and we write: 


4 

P(R|L)= =. 

(RIL) =< 

This is not the same as P(RNML), which is 1/3. Also note that P(R|L) is the 
proportion that P(RNM L) is of P(L). 


26 3 Conditional probability and independence 


QUICK EXERCISE 3.1 Let N = R° be the event “born in a month without r.” 
What is the conditional probability P(N | L)? 


Recalling the three envelopes on our doormat, consider the events “envelope 1 
is the middle one” (call this event A) and “envelope 2 is the middle one” (B). 
Then P(A) = P(213 or 312) = 1/3; by symmetry, the same is found for P(B). 
We say that the envelopes are in order if their order is either 123 or 321. 
Suppose we know that they are not in order, but otherwise we do not know 
anything; what are the probabilities of A and B, given this information? 


Let C be the event that the envelopes are not in order, so: C = {123, 321}*° = 
{132, 213, 231,312}. We ask for the probabilities of A and B, given that C 
occurs. Event C consists of four elements, two of which also belong to A: 
ANC = {213,312}, so P(A|C) = 1/2. The probability of ANC is half of 
P(C). No element of C also belongs to B, so P(B|C) = 0. 


QUICK EXERCISE 3.2 Calculate P(C'| A) and P(C°| AU B). 


In general, computing the probability of an event A, given that an event C 
occurs, means finding which fraction of the probability of C is also in the 
event A. 


DEFINITION. The conditional probability of A given C is given by: 


Pale) = =). 


provided P(C) > 0. 
QUICK EXERCISE 3.3 Show that P(A|C) + P(A‘°|C) =1. 


This exercise shows that the rule P(A‘°) = 1 — P(A) also holds for conditional 
probabilities. In fact, even more is true: if we have a fixed conditioning event C 
and define Q(A) = P(A|C) for events A C Q, then Q is a probability function 
and hence satisfies all the rules as described in Chapter 2. The definition of 
conditional probability agrees with our intuition and it also works in situations 
where computing probabilities by counting outcomes does not. 


A chemical reactor: residence times 


Consider a continuously stirred reactor vessel where a chemical reaction takes 
place. On one side fluid or gas flows in, mixes with whatever is already present 
in the vessel, and eventually flows out on the other side. One characteristic 
of each particular reaction setup is the so-called residence time distribution, 
which tells us how long particles stay inside the vessel before moving on. We 
consider a continuously stirred tank: the contents of the vessel are perfectly 
mixed at all times. 


3.2 The multiplication rule 27 


Let R;, denote the event “the particle has a residence time longer than t 
seconds.” In Section 5.3 we will see how continuous stirring determines the 
probabilities; here we just use that in a particular continuously stirred tank, 
R; has probability e~*. So: 


P(R3) = e~* = 0.04978... 
P(R4) = e"* = 0.01831.... 


We can use the definition of conditional probability to find the probability 
that a particle that has stayed more than 3 seconds will stay more than 4: 


P(R4| R3) = CE) = = a 


—_ =e l= 
PUR) 7 BUR) 7 eB WO 7 038787... 


QUICK EXERCISE 3.4 Calculate P(Rs | R{). 


For more details on the subject of residence time distributions see, for example, 
the book on reaction engineering by Fogler ([11]). 


3.2 The multiplication rule 


From the definition of conditional probability we derive a useful rule by mul- 
tiplying left and right by P(C). 


THE MULTIPLICATION RULE. For any events A and C: 


P(ANC) = P(A|C)- PC). 


Computing the probability of ANC can hence be decomposed into two parts, 
computing P(C’) and P(A|C) separately, which is often easier than computing 
P(ANC) directly. 


The probability of no coincident birthdays 


Suppose you meet two arbitrarily chosen people. What is the probability their 
birthdays are different? Let Bz denote the event that this happens. Whatever 
the birthday of the first person is, there is only one day the second person 
cannot “pick” as birthday, so: 
P(B2) =1 : 
a0 365° 
When the same question is asked with three people, conditional probabilities 
become helpful. The event B3 can be seen as the intersection of the event Bo, 


28 3 Conditional probability and independence 


“the first two have different birthdays,” with event A3 “the third person has 
a birthday that does not coincide with that of one of the first two persons.” 
Using the multiplication rule: 


P(B3) = P(A3 NM Bz) = P(A3| B2)P(B2). 


The conditional probability P(A3 | Bz) is the probability that, when two days 
are already marked on the calendar, a day picked at random is not marked, 


or 


2 
PLA (eA ai == 
(As| Ba) 365” 


and so 
P(B3) = P(A3| Bo)P(Bo) = [1 1-—_) =0 9918 
eS eee a 365 365/ 


We are already halfway to solving the general question: in a group of n arbi- 
trarily chosen people, what is the probability there are no coincident birth- 
days? The event B,, of no coincident birthdays among the n persons is the 
same as: “the birthdays of the first n — 1 persons are different” (the event 
B,-1) and “the birthday of the nth person does not coincide with a birthday 
of any of the first n — 1 persons” (the event A,,), that is, 


Bn = An, a) Byi- 


Applying the multiplication rule yields: 


P(B,) = P(An | Bn-1) -P(Bu-1) = (1 - a -P(By-1) 


as person n should avoid n — 1 days. Applying the same step to P(B,_1), 
P(B,-2), etc., we find: 


P(Bn) = (1 _— ay »P(An-1 | Bn—2) -P(By_2) 


This can be used to compute the probability for arbitrary n. For example, 
we find: P( B22) = 0.5243 and P(Bo3) = 0.4927. In Figure 3.1 the probability 


3.2 The multiplication rule 29 


1.0 


0.8 


0.6 


0.4 
0.2 


a6 1 _ «—_—sSFLERS Sosa sas deudes cavsadessucesbessaucesdeeaaceuese 


Fig. 3.1. The probability P(B,) of no coincident birthdays for n = 1,..., 100. 


P(B,,) is plotted for n = 1,...,100, with dotted lines drawn at n = 23 and 
at probability 0.5. It may be hard to believe, but with just 23 people the 
probability of all birthdays being different is less than 50%! 


QUICK EXERCISE 3.5 Compute the probability that three arbitrary people are 
born in different months. Can you give the formula for n people? 
It matters how one conditions 


Conditioning can help to make computations easier, but it matters how it is 
applied. To compute P(A MC) we may condition on C' to get 


P(ANC) =P(A|C)-P(C); 
or we may condition on A and get 
P(ANC) = P(C|A)- P(A). 


Both ways are valid, but often one of P(A|C) and P(C| A) is easy and the 
other is not. For example, in the birthday example one could have tried: 


P(B3) = P(A3 9 Bz) = P(Bo| A3)P(As3), 


but just trying to understand the conditional probability P(B2 | A3) already 
is confusing: 


The probability that the first two persons’ birthdays differ given that 
the third person’s birthday does not coincide with the birthday of one 
of the first two ...? 


Conditioning should lead to easier probabilities; if not, it is probably the 
wrong approach. 


30 3 Conditional probability and independence 


3.3 The law of total probability and Bayes’ rule 


We will now discuss two important rules that help probability computations 
by means of conditional probabilities. We introduce both of them in the next 
example. 


Testing for mad cow disease 


In early 2001 the European Commission introduced massive testing of cattle 
to determine infection with the transmissible form of Bovine Spongiform En- 
cephalopathy (BSE) or “mad cow disease.” As no test is 100% accurate, most 
tests have the problem of false positives and false negatives. A false positive 
means that according to the test the cow is infected, but in actuality it is not. 
A false negative means an infected cow is not detected by the test. 


Imagine we test a cow. Let B denote the event “the cow has BSE” and T 
the event “the test comes up positive” (this is test jargon for: according to 
the test we should believe the cow is infected with BSE). One can “test the 
test” by analyzing samples from cows that are known to be infected or known 
to be healthy and so determine the effectiveness of the test. The European 
Commission had this done for four tests in 1999 (see [19]) and for several 
more later. The results for what the report calls Test A may be summarized 
as follows: an infected cow has a 70% chance of testing positive, and a healthy 
cow just 10%; in formulas: 


P(T'| B) = 0.70, 
P(T | B°) = 0.10. 


Suppose we want to determine the probability P(7) that an arbitrary cow 
tests positive. The tested cow is either infected or it is not: event T’ occurs in 
combination with B or with B° (there are no other possibilities). In terms of 
events 

T=(TNB)U(TNB), 


so that 
P(T)=P(TNB)+P(TNB’*), 
because TN B and TM B¢ are disjoint. Next, apply the multiplication rule (in 
such a way that the known conditional probabilities appear!): 
P(TN B) = P(T| B)- P(B) 


P(TN B°) = P(T| B°)- P(B®) (3.1) 


so that 
P(T) = P(T| B)- P(B) + P(T| BY) - P(B’). (3.2) 


This is an application of the law of total probability: computing a probability 
through conditioning on several disjoint events that make up the whole sample 


3.3 The law of total probability and Bayes’ rule 31 


space (in this case two). Suppose! P(B) = 0.02; then from the last equation 
we conclude: P(T’) = 0.02 - 0.70 + (1 — 0.02) - 0.10 = 0.112. 


QUICK EXERCISE 3.6 Calculate P(T) when P(T | B) = 0.99 and P(T'| B°) = 
0.05. 


Following is a general statement of the law. 


THE LAW OF TOTAL PROBABILITY. Suppose C1, C2, ..., Cm are 
disjoint events such that Cy UC2U-:-UC,, = Q. The probability of 
an arbitrary event A can be expressed as: 


P(A) = P(A|C1)P(C1) + P(A| C2)P(C2) +--+ + P(A|Cm)P(Cm).- 


Figure 3.2 illustrates the law for m = 5. The event A is the disjoint union of 
ANC;, fori =1,...,5, so P(A) = P(ANC\)+---+P(ANCS), and for each i 
the multiplication rule states P(AMC;) = P(A|C;) - P(C;). 


Fig. 3.2. The law of total probability (illustration for m = 5). 


In the BSE example, we have just two mutually exclusive events: substitute 
m = 2, C, = B, C2 = B°, and A=T to obtain (3.2). 

Another, perhaps more pertinent, question about the BSE test is the following: 
suppose my cow tests positive; what is the probability it really has BSE? 
Translated, this asks for the value of P(B|T). The information we were given 
is P(T'| B), a conditional probability, but the wrong one. We would like to 
switch T and B. 


Start with the definition of conditional probability and then use equations 
(3.1) and (3.2): 


' We choose this probability for the sake of the calculations that follow. The true 
value is unknown and varies from country to country. The BSE risk for the Nether- 
lands for 2003 was estimated to be P(B) 0.000013. 


32 3 Conditional probability and independence 


P(TMB) P(T|B)- P(B) 
PRIN) = “Say = BEET By PB) + PCI BY PBS 


So with P(B) = 0.02 we find 


0.70 - 0.02 


P21) =. 
ey 0.70 - 0.02 + 0.10 - (1 — 0.02) 


= 0.125, 


and by a similar calculation: P(B|T°) = 0.0068. These probabilities reflect 
that this Test A is not a very good test; a perfect test would result in 
P(B|T) = 1 and P(B|T°) = 0. In Exercise 3.4 we redo this calculation, 
replacing P(B) = 0.02 with a more realistic number. 


What we have just seen is known as Bayes’ rule, after the English clergyman 
Thomas Bayes who derived this in the 18th century. The general statement 
follows. 


BAYES’ RULE. Suppose the events C1, Co, ..., Cm are disjoint and 
Cy UC2U---UC, = ©. The conditional probability of C;, given an 
arbitrary event A, can be expressed as: 


_ P(A|C;) - P(Ci) 
P(Cil 4) = SCATGR)P(Ca) + P(A C2)P(Ca) + P(A On)P(On) 


This is the traditional form of Bayes’ formula. It follows from 


P(A] Ci) - P(Ci) 


(3.3) 


in combination with the law of total probability applied to P(A) in the de- 
nominator. Purists would refer to (3.3) as Bayes’ rule, and perhaps they are 
right. 


QUICK EXERCISE 3.7 Calculate P(B|T') and P(B|T°) if P(T’'| B) = 0.99 and 
P(T'| B*) = 0.05. 


3.4 Independence 


Consider three probabilities from the previous section: 
P(B) = 0.02, 
P(B|T) = 0.125, 
P(B|T°) = 0.0068. 


If we know nothing about a cow, we would say that there is a 2% chance it is 
infected. However, if we know it tested positive, we can say there is a 12.5% 


3.4 Independence 33 


chance the cow is infected. On the other hand, if it tested negative, there is 
only a 0.68% chance. We see that the two events are related in some way: the 
probability of B depends on whether T' occurs. 


Imagine the opposite: the test is useless. Whether the cow is infected is unre- 
lated to the outcome of the test, and knowing the outcome of the test does not 
change our probability of B: P(B|T) = P(B). In this case we would call B 
independent of T. 


DEFINITION. An event A is called independent of B if 


P(A| B) =P(A). 


From this simple definition many statements can be derived. For example, 
because P(A°| B) = 1 — P(A| B) and 1 — P(A) = P(A‘), we conclude: 


A independent of B <= _ A® independent of B. (3.4) 


By application of the multiplication rule, if A is independent of B, then 
P(AN B) = P(A| B)P(B) = P(A) P(B). On the other hand, if P(AN B) = 
P(A) P(B), then P(A| B) = P(A) follows from the definition of independence. 
This shows: 
A independent of B <= P(ANB)=P(A)P(B). 
Finally, by definition of conditional probability, if A is independent of B, then 
P(ANB)  P(A)-P(B) 


P(B| A) = ———— = ——— = P(B 
that is, B is independent of A. This works in reverse, too, so we have: 
A independent of B <_ B independent of A. (3.5) 


This statement says that in fact, independence is a mutual property. Therefore, 
the expressions “A is independent of B” and “A and B are independent” are 
used interchangeably. From the three <-statements it follows that there are 
in fact 12 ways to show that A and B are independent; and if they are, there 
are 12 ways to use that. 


INDEPENDENCE. To show that A and B are independent it suffices 
to prove just one of the following: 


P(A|B) =P(A), 
P(B| A) = P(B), 
P(An B) = P(A) P(B), 
where A may be replaced by A‘ and B replaced by B‘°, or both. If 


one of these statements holds, all of them are true. If two events are 
not independent, they are called dependent. 


34 3 Conditional probability and independence 


Recall the birthday events L “born in a long month” and R “born in a month 
with the letter r.” Let H be the event “born in the first half of the year,” 
so P(H) = 1/2. Also, P(H | R) = 1/2. So H and R are independent, and we 
conclude, for example, P(R°| H°) = P(R°) =1—- 8/12 = 1/3. 

We know that P(Z9 H) = 1/4 and P(L) = 7/12. Checking 1/2 x 7/12 4 1/4, 
you conclude that LZ and H are dependent. 


QUICK EXERCISE 3.8 Derive the statement “R° is independent of H°” from 
“H is independent of R” using rules (3.4) and (3.5). 


Since the words dependence and independence have several meanings, one 
sometimes uses the terms stochastic or statistical dependence and indepen- 
dence to avoid ambiguity. 


Remark 3.1 (Physical and stochastic independence). Stochastic 
dependence or independence can sometimes be established by inspecting 
whether there is any physical dependence present. The following statements 
may be made. 

If events have to do with processes or experiments that have no physical con- 
nection, they are always stochastically independent. If they are connected 
to the same physical process, then, as a rule, they are stochastically de- 
pendent, but stochastic independence is possible in exceptional cases. The 
events H and R are an example. 


Independence of two or more events 


When more than two events are involved we need a more elaborate definition 
of independence. The reason behind this is explained by an example following 
the definition. 


INDEPENDENCE OF TWO OR MORE EVENTS. Events Aj, Ag, ..., 
Am are called independent if 


Pas ne ae aie Cine 


and this statement also holds when any number of the events Aj, 
..., Am are replaced by their complements throughout the formula. 


You see that we need to check 2” equations to establish the independence of 
m events. In fact, m+ 1 of those equations are redundant, but we chose this 
version of the definition because it is easier. 

The reason we need to do so much more checking to establish independence 
for multiple events is that there are subtle ways in which events may depend 
on each other. Consider the question: 


Is independence for three events A, B, and C the same as: A and B are 
independent; B and C are independent; and A and C are independent? 


3.5 Solutions to the quick exercises 35 


The answer is “No,” as the following example shows. Perform two independent 
tosses of a coin. Let A be the event “heads on toss 1,” B the event “heads on 
toss 2,” and C' “the two tosses are equal.” 


First, get the probabilities. Of course, P(A) = P(B) = 1/2, but also 


i. 0. 7 
P(C) =P(ANB)+P(A°NB) =7+5=5. 


What about independence? Events A and B are independent by assumption, 
so check the independence of A and C. Given that the first toss is heads (A 
occurs), C' occurs if and only if the second toss is heads as well (B occurs), so 


P(C| A) = P(B| A) = P(B) = = =P(C). 


By symmetry, also P(C'|B) = P(C), so all pairs taken from A, B, C are 
independent: the three are called pairwise independent. Checking the full con- 
ditions for independence, we find, for example: 


P(AN BNC) =P(ANB)= whereas P(A) P(B)P(C) == 


and 
1 
= 5) 
The reason for this is clear: whether C occurs follows deterministically from 
the outcomes of tosses 1 and 2. 


P(AN BNC*) = P(0) =0, whereas P(A) P(B)P(C*) 


3.5 Solutions to the quick exercises 


3.1 N = {May, Jun, Jul, Aug}, L = {Jan, Mar, May, Jul, Aug, Oct, Dec}, 
and NL = {May, Jul, Aug}. Three out of seven outcomes of L belong to 
N as well, so P(N | L) = 3/7. 


3.2 The event A is contained in C. So when A occurs, C also occurs; therefore 
P(C | A) =1. 

Since C* = {123,321} and AU B = {123, 321, 312, 213}, one can see that two 
of the four outcomes of AU B belong to C® as well, so P(C°| AU B) = 1/2. 


3.3 Using the definition we find: 


P(A|C) + P(at|C) = =O) 4 SEO) <1, 


because C can be split into disjoint parts AN C and A° MC and therefore 


P(ANC)+P(ASNC) = P(C). 


36 3 Conditional probability and independence 


3.4 This asks for the probability that the particle stays more than 3 seconds, 
given that it does not stay longer than 4 seconds, so 4 or less. From the 


definition: P(R3 A RS) 
ial ic 
P(R3|R{) = San 


The event R3 M R{ describes: longer than 3 but not longer than 4 seconds. 
Furthermore, R3 is the disjoint union of the events R3N RG and RgNR4 = Ra, 
so P(R3M R{) = P(R3) — P(R4) = e-3 — e~ 4. Using the complement rule: 
P(R§) =1—P(R4) = 1— e+. Together: 

e?—e* 0.0315 


——_ = —, = 0.0321. 
1l—e-4 0.9817 


P(R3| R4) = 


3.5 Instead of a calendar of 365 days, we have one with just 12 months. Let 
C,, be the event n arbitrary persons have different months of birth. Then 


2 1 55 
P(C3) = (1 = =) (1 - 5) a5 = 0.7639 


and it is no surprise that this is much smaller than P(B3). The general formula 


he a, 


Note that it is correct even if n is 13 or more, in which case P(C,,) = 0. 


3.6 Repeating the calculation we find: 
P(TB) = 0.99 - 0.02 = 0.0198 
P(T.N B®) = 0.05 - 0.98 = 0.0490 
so P(T) = P(T. AB) + P(TN B®) = 0.0198 + 0.0490 = 0.0688. 
3.7 In the solution to Quick exercise 3.5 we already found P(T'N B) = 0.0198 
and P(T) = 0.0688, so 
P(TOB) _ 0.0198 
P(T) ‘(0.0688 


Further, P(T°) = 1 — 0.0688 = 0.9312 and P(T¢|_B) = 1—P(T|B) = 0.01. 
So, P(BN T°) = 0.01 - 0.02 = 0.0002 and 


0.0002 
~ 0.9312 


3.8 It takes three steps of applying (3.4) and (3.5): 


P(B|T) = = 0.2878. 


= 0.00021. 


P(B|T°) 


H independent of R <= 4H’ independent of R_ by (3.4) 
H independent of R <= _ R independent of H° by (3.5) 
R independent of H° <= _ R® independent of H° by (3.4). 


3.6 Exercises 37 


3.6 Exercises 


3.1 Your lecturer wants to walk from A to B (see the map). To do so, he 
first randomly selects one of the paths to C, D, or EF. Next he selects randomly 
one of the possible paths at that moment (so if he first selected the path to 
E, he can either select the path to A or the path to F), etc. What is the 
probability that he will reach B after two selections? 


A 


C D E 


B PF 


3.2 4 A fair die is thrown twice. A is the event “sum of the throws equals 4,” 
B is “at least one of the throws is a 3.” 


a. Calculate P(A| B). 
b. Are A and B independent events? 


3.3 H We draw two cards from a regular deck of 52. Let 5; be the event “the 
first one is a spade,” and S> “the second one is a spade.” 


a. Compute P(S4), P(So | Si), and P(S2 | S$). 
b. Compute P(S2) by conditioning on whether the first card is a spade. 


3.4 FJ A Dutch cow is tested for BSE, using Test A as described in Section 3.3, 
with P(T'| B) = 0.70 and P(T | B°) = 0.10. Assume that the BSE risk for the 
Netherlands is the same as in 2003, when it was estimated to be P(B) = 
1.3-10~>°. Compute P(B|T) and P(B|T°). 


3.5 A ball is drawn at random from an urn containing one red and one white 
ball. If the white ball is drawn, it is put back into the urn. If the red ball 
is drawn, it is returned to the urn together with two more red balls. Then a 
second draw is made. What is the probability a red ball was drawn on both 
the first and the second draws? 


3.6 We choose a month of the year, in such a manner that each month has 
the same probability. Find out whether the following events are independent: 


a. the events “outcome is an even numbered month” (i.e., February, April, 
June, etc.) and “outcome is in the first half of the year.” 

b. the events “outcome is an even numbered month” (i.e., February, April, 
June, etc.) and “outcome is a summer month” (i.e., June, July, August). 


38 3 Conditional probability and independence 


3.7 H Calculate 


a. P(AU B) if it is given that P(A) = 1/3 and P(B| A‘) = 1/4. 
b. P(B) if it is given that P(AU B) = 2/3 and P(A‘| B°) = 1/2. 


3.8 H Spaceman Spiff’s spacecraft has a warning light that is supposed to 
switch on when the freem blasters are overheated. Let W be the event “the 
warning light is switched on” and F' “the freem blasters are overheated.” 
Suppose the probability of freem blaster overheating P(F’) is 0.1, that the 
light is switched on when they actually are overheated is 0.99, and that there 
is a 2% chance that it comes on when nothing is wrong: P(W | F°) = 0.02. 


a. Determine the probability that the warning light is switched on. 


b. Determine the conditional probability that the freem blasters are over- 
heated, given that the warning light is on. 


3.9 FA certain grapefruit variety is grown in two regions in southern Spain. 
Both areas get infested from time to time with parasites that damage the 
crop. Let A be the event that region R is infested with parasites and B that 
region Rz is infested. Suppose P(A) = 3/4, P(B) = 2/5 and P(AU B) = 4/5. 
If the food inspection detects the parasite in a ship carrying grapefruits from 
R,, what is the probability region Rz is infested as well? 


3.10 A student takes a multiple-choice exam. Suppose for each question he 
either knows the answer or gambles and chooses an option at random. Further 
suppose that if he knows the answer, the probability of a correct answer is 1, 
and if he gambles this probability is 1/4. To pass, students need to answer at 
least 60% of the questions correctly. The student has “studied for a minimal 
pass,” i.e., with probability 0.6 he knows the answer to a question. Given that 
he answers a question correctly, what is the probability that he actually knows 
the answer? 


3.11 A breath analyzer, used by the police to test whether drivers exceed 
the legal limit set for the blood alcohol percentage while driving, is known to 
satisfy 

P(A| B) = P(A*| B®) =p, 
where A is the event “breath analyzer indicates that legal limit is exceeded” 
and B “driver’s blood alcohol percentage exceeds legal limit.” On Saturday 
night about 5% of the drivers are known to exceed the limit. 


a. Describe in words the meaning of P(B° | A). 
b. Determine P(B°| A) if p = 0.95. 
c. How big should p be so that P(B| A) = 0.9? 


3.12 The events A, B, and C satisfy: P(A| BNC) = 1/4, P(B|C) = 1/3, 
and P(C) = 1/2. Calculate P(ASN BNC). 


3.6 Exercises 39 


3.13 In Exercise 2.12 we computed the probability of a “dream draw” in the 
UEFA playoffs lottery by counting outcomes. Recall that there were ten teams 
in the lottery, five considered “strong” and five considered “weak.” Introduce 
events D;, “the ith pair drawn is a dream combination,” where a “dream 


combination” is a pair of a strong team with a weak team, andi=1,...,5. 
a. Compute P(D}). 

b. Compute P(Do | D1) and P(D, NM Do). 

Cc. Compute P(D3 | Dy NM D2) and P(D, NM Doz N Ds). 

d. Continue the procedure to obtain the probability of a “dream draw”: 


PCD; t=+-11.D5). 


3.14 Recall the Monty Hall problem from Section 1.3. Let R be the event 
“the prize is behind the door you chose initially,” and W the event “you win 
the prize by switching doors.” 


a. Compute P(W | R) and P(W | R*). 
b. Compute P(W) using the law of total probability. 


3.15 Two independent events A and B are given, and P(B|AU B) = 2/3, 
P(A| B) = 1/2. What is P(B)? 


3.16 You are diagnosed with an uncommon disease. You know that there 
only is a 1% chance of getting it. Use the letter D for the event “you have the 
disease” and T for “the test says so.” It is known that the test is imperfect: 
P(T'| D) = 0.98 and P(T*| D°) = 0.95. 


a. Given that you test positive, what is the probability that you really have 
the disease? 

b. You obtain a second opinion: an independent repetition of the test. You 
test positive again. Given this, what is the probability that you really have 
the disease? 


3.17 You and I play a tennis match. It is deuce, which means if you win the 
next two rallies, you win the game; if I win both rallies, I win the game; if 
we each win one rally, it is deuce again. Suppose the outcome of a rally is 
independent of other rallies, and you win a rally with probability p. Let W be 
the event “you win the game,” G “the game ends after the next two rallies,” 
and D “it becomes deuce again.” 


a. Determine P(W |G). 
b. Show that P(W) = p? + 2p(1 — p)P(W | D) and use P(W) = P(W|D) 
(why is this so?) to determine P(W). 


c. Explain why the answers are the same. 


40 3 Conditional probability and independence 
3.18 Suppose A and B are events with 0 < P(A) < 1 and 0 < P(B) <1. 


. If A and B are disjoint, can they be independent? 

. If A and B are independent, can they be disjoint? 

. If AC B, can A and B be independent? 

. If A and B are independent, can A and AU B be independent? 


aonT7r p 


A 


Discrete random variables 


The sample space associated with an experiment, together with a probability 
function defined on all its events, is a complete probabilistic description of 
that experiment. Often we are interested only in certain features of this de- 
scription. We focus on these features using random variables. In this chapter 
we discuss discrete random variables, and in the next we will consider contin- 
uous random variables. We introduce the Bernoulli, binomial, and geometric 
random variables. 


4.1 Random variables 


Suppose we are playing the board game “Snakes and Ladders,” where the 
moves are determined by the sum of two independent throws with a die. An 
obvious choice of the sample space is 


Q = {(w1, we) : wi, we € {1,2,...,6}} 
= {(1,1), (1, 2),..., (1,6), (2,1),..., (6,5), (6,6)}. 


However, as players of the game, we are only interested in the sum of the 
outcomes of the two throws, i.e., in the value of the function S : Q — R, given 
by 

S(w1,W2)=w, tw. for (w1,w2) ED. 


In Table 4.1 the possible results of the first throw (top margin), those of the 
second throw (left margin), and the corresponding values of S (body) are 
given. Note that the values of S are constant on lines perpendicular to the 
diagonal. We denote the event that the function S attains the value k by 
{S = k}, which is an abbreviation of “the subset of those w = (w1,w2) € 2 
for which S(w1,w2) =w1 +wo =k,” ie., 


{S =k} = {(wi,we) ED: S(w1,w2) =k}. 


42 4 Discrete random variables 


Table 4.1. Two throws with a die and the corresponding sum. 


Wy 
We 123 4 5 6 
1 23 4 5 6 7 
2 3.45 6 7 8 
3 456 7 8 9 
4 5 6 7 8 9 10 
5 6 7 8 9 10 11 
6 7 8 9 10 11 12 


QUICK EXERCISE 4.1 List the outcomes in the event {S = 8}. 


We denote the probability of the event {S = k} by 


although formally we should write P({.S = k}) instead of P(.S =k). In our 
example, S' attains only the values k = 2,3,...,12 with positive probability. 
For example, 


P(S =2) =P((L,1)) = 3, 


while 
P(S = 13) = P(0) =0, 


because 13 is an “impossible outcome.” 


QUICK EXERCISE 4.2 Use Table 4.1 to determine P(S = k) fork = 4,5,...,12. 


Now suppose that for some other game the moves are given by the maximum 
of two independent throws. In this case we are interested in the value of the 
function M:Q— R, given by 


M(wy,we) = max{w1,w2} for (wi,w2) € 2. 


In Table 4.2 the possible results of the first throw (top margin), those of the 
second throw (left margin), and the corresponding values of M (body) are 
given. The functions S and M are examples of what we call discrete random 
variables. 


DEFINITION. Let {2 be a sample space. A discrete random variable 
is a function X : Q — R that takes on a finite number of values 
@1,42,...,@m or an infinite number of values aj, a2,.... 


4.2 The probability distribution of a discrete random variable 43 


Table 4.2. Two throws with a die and the corresponding maximum. 


Wy 


mao»rwnr 

aOw»rkwnr aa 
aw»1k wb vw NO 
ow» ww w w 
aAo»r Kk  & a 
OD Ot Ot OT OT OT or 
AADADAAD a 


In a way, a discrete random variable X “transforms” a sample space 2 to a 
more “tangible” sample space 2, whose events are more directly related to 
what you are interested in. For instance, S transforms 2 = {(1,1), (1,2),..., 


(1,6), (2,1),..., (6,5), (6,6)} to Q = {2,...,12}, and M transforms to 


Q = {1,...,6}. Of course, there is a price to pay: one has to calculate the 
probabilities of X. Or, to say things more formally, one has to determine 
the probability distribution of X, i.e., to describe how the probability mass is 
distributed over possible values of X. 


4.2 The probability distribution of a discrete random 
variable 


Once a discrete random variable X is introduced, the sample space Q is no 
longer important. It suffices to list the possible values of X and their corre- 
sponding probabilities. This information is contained in the probability mass 
function of X. 


DEFINITION. The probability mass function p of a discrete random 
variable X is the function p: R — [0,1], defined by 


pia) =P(X =a) for —w<a<o. 


If X is a discrete random variable that takes on the values a1, a2,..., then 
p(ai) >0, plar)+p(az)+---=1, and p(a) = 0 for all other a. 


As an example we give the probability mass function p of M. 


a i @ &® «4°38 6 
pla) 1/36 3/36 5/36 7/36 9/36 11/36 


Of course, p(a) = 0 for all other a. 


44 4 Discrete random variables 


The distribution function of a random variable 


As we will see, so-called continuous random variables cannot be specified 
by giving a probability mass function. However, the distribution function of 
a random variable X (also known as the cumulative distribution function) 
allows us to treat discrete and continuous random variables in the same way. 


DEFINITION. The distribution function F of a random variable X 
is the function F’ : R — [0,1], defined by 


F(a)=P(X <a) for -—c <a<o. 


Both the probability mass function and the distribution function of a discrete 
random variable X contain all the probabilistic information of X; the probabil- 
ity distribution of X is determined by either of them. In fact, the distribution 
function F' of a discrete random variable X can be expressed in terms of the 
probability mass function p of X and vice versa. If X attains values aj, a2,..., 
such that 

p(ai) >0, plai)+p(a2) +--- =1, 


then 
F(a) = > pla). 
a,<a 
We see that, for a discrete random variable X, the distribution function F’ 
jumps in each of the a;, and is constant between successive a;. The height of 
the jump at a; is p(a;); in this way p can be retrieved from F’. For example, 
see Figure 4.1, where p and F are displayed for the random variable M. 


F(a) 
25/36 = 
16/36 a 
11/36 p(a) . 
9/36 . 9/36 ee 
oe e 
5/36 ° 
3/36 . 4/36 — 
1/36 . 1/36 — 
rT fr. FF OF F FT ? st © tk & = 
1 2 3 4 5 6 il 2 3 4 5 6 
a a 


Fig. 4.1. Probability mass function and distribution function of M. 


4.3 The Bernoulli and binomial distributions 45 


We end this section with three properties of the distribution function F of a 
random variable X: 


1. For a < b one has that F(a) 
consequence of the fact that a 
contained in the event {X < b}. 

2. Since F(a) is a probability, the value of the distribution function is always 
between 0 and 1. Moreover, 


F(b). This property is an immediate 


< 
< b implies that the event {X < a} is 


lim F(a) = lim P(X <a)=1 
lim F(a)= lim P(X <a)=0 


3. F is right-continuous, i.e., one has 
lim F(a + ¢) = F(a). 
ane) 
This is indicated in Figure 4.1 by bullets. Henceforth we will omit these 
bullets. 


Conversely, any function F' satisfying 1, 2, and 3 is the distribution function 
of some random variable (see Remarks 6.1 and 6.2). 


QUICK EXERCISE 4.3 Let X be a discrete random variable, and let a be such 
that p(a) > 0. Show that F(a) = P(X < a)+p(a). 


There are many discrete random variables that arise in a natural way. We 
introduce three of them in the next two sections. 


4.3 The Bernoulli and binomial distributions 


The Bernoulli distribution is used to model an experiment with only two pos- 
sible outcomes, often referred to as “success” and “failure”, usually encoded 
as 1 and 0. 


DEFINITION. A discrete random variable X has a Bernoulli distri- 
bution with parameter p, where 0 < p < 1, if its probability mass 
function is given by 


px(I)=P(X=1)=p and px(0)=P(X=0)=1-p. 
We denote this distribution by Ber(p). 


Note that we wrote px instead of p for the probability mass function of X. This 
was done to emphasize its dependence on X and to avoid possible confusion 
with the parameter p of the Bernoulli distribution. 


46 4 Discrete random variables 


Consider the (fictitious) situation that you attend, completely unprepared, 
a multiple-choice exam. It consists of 10 questions, and each question has 
four alternatives (of which only one is correct). You will pass the exam if 
you answer six or more questions correctly. You decide to answer each of the 
questions in a random way, in such a way that the answer of one question is 
not affected by the answers of the others. What is the probability that you 
will pass? 

Setting for i= 1,2,...,10 


R= 1 if the ith answer is correct 
‘" \0. if the ith answer is incorrect, 


the number of correct answers X is given by 


X=R,+ Ro+ Rg3+ Ro +R5+ Re + R7+ Rgt+ Ro + Rio. 


QUICK EXERCISE 4.4 Calculate the probability that you answered the first 
question correctly and the second one incorrectly. 


Clearly, X attains only the values 0,1,...,10. Let us first consider the case 
X = 0. Since the answers to the different questions do not influence each other, 
we conclude that the events {R; = a1},...,{Rio = aio} are independent for 
every choice of the a;, where each a; is 0 or 1. We find 


P(X = 0) = P(not a single R; equals 1) 
P(R, = 0, Ro = 0,..., Rio = 0) 
( 


P(X=1)=5. ($) 


which is the probability that the answer is correct times the probability that 
the other nine answers are wrong, times the number of ways in which this can 
occur: 


BX=H= PR, HP SPR 0) oP Ry a0 
+ P(Ri = 0) P(R2 = 1) P(As = 0)---P( Rip = 0) 


+P(R, = 0) P(Re - 0) P(Rs =0)---P(Rip = 1). 


In general we find for k = 0,1,...,10, again using independence, that 


4.3 The Bernoulli and binomial distributions 47 


poran=(4)'-(2)"" con 


which is the probability that k questions were answered correctly times the 
probability that the other 10—k answers are wrong, times the number of ways 
Cio,~ this can occur. 


So Cio,x is the number of different ways in which one can choose k correct 
answers from the list of 10. We already have seen that Cio,9 = 1, because 
there is only one way to do everything wrong; and that Ci9,; = 10, because 
each of the 10 questions may have been answered correctly. 

More generally, if we have to choose k different objects out of an ordered list 
of n objects, and the order in which we pick the objects matters, then for 
the first object you have n possibilities, and no matter which object you pick, 
for the second one there are n — 1 possibilities. For the third there are n — 2 
possibilities, and so on, with n — (k — 1) possibilities for the kth. So there are 


n(n —1)--+ (n—(k-—1)) 


ways to choose the k objects. 

In how many ways can we choose three questions? When the order matters, 
there are 10-9-8 ways. However, the order in which these three questions are 
selected does not matter: to answer questions 2, 5, and 8 correctly is the same 
as answering questions 8, 2, and 5 correctly, and so on. The triplet {2,5, 8} 
can be chosen in 3-2-1 different orders, all with the same result. There are 
six permutations of the numbers 2, 5, and 8 (see page 14). 

Thus, compensating for this six-fold overcount, the number C9,3 of ways to 


correctly answer 3 questions out of 10 becomes 
10-9-8 

Cio3 = ———. 
ee Bed 


More generally, forn >landl<k<n, 


n(n=1) + (n= (k=) 


Oni = MRT) or 


Note that this is equal to 
n! 


k(n —k)V 


which is usually denoted by (7), 80 Cn,x = (2). Moreover, in accordance with 
0! = 1 (as defined in Chapter 2), we put Cno = (5) =1. 
QUICK EXERCISE 4.5 Show that (,,",) = ({). 


n—k 


Substituting CG) for Cio,, we obtain 


48 4 Discrete random variables 


ros-m=(2)() 


Since P(X > 6) = P(X =6)+---+ P(X = 10), it is now an easy (but te- 
dious) exercise to determine the probability that you will pass. One finds that 
P(X > 6) = 0.0197. It pays to study, doesn’t it?! 

The preceding random variable X is an example of a random variable with a 
binomial distribution with parameters n = 10 and p= 1/4. 


DEFINITION. A discrete random variable X has a binomial distri- 
bution with parameters n and p, where n = 1,2,...and0<p<1l, 
if its probability mass function is given by 


i SO = = (jez) fon b= 0ek a. 


We denote this distribution by Bin(n, p). 


Figure 4.2 shows the probability mass function px and distribution function 
Fx of a Bin(10, +) distributed random variable. 


O3 F 1.0 Sa 
: = Fx (a) 
0.8 eae 
0.2 . 
0.6 : 
oo 0.4 
° px (k) 0.2 : 
ue eterno 0.0 4+ —~ a 
0123 45 67 8 9 10 012345 678910 
k a 


Fig. 4.2. Probability mass function and distribution function of the Bin(10, +) 
distribution. 


4.4 The geometric distribution 


In 1986, Weinberg and Gladen [38] investigated the number of menstrual cy- 
cles it took women to become pregnant, measured from the moment they had 


4.4 The geometric distribution 49 


decided to become pregnant. We model the number of cycles up to pregnancy 
by a random variable X. 


Assume that the probability that a woman becomes pregnant during a partic- 
ular cycle is equal to p, for some p with 0 < p < 1, independent of the previous 
cycles. Then clearly P(X = 1) = p. Due to the independence of consecutive 
cycles, one finds for k = 1,2,... that 


P(X =k) = P(no pregnancy in the first k — 1 cycles, pregnancy in the kth) 
= (1—p)*'p. 


This random variable X is an example of a random variable with a geometric 
distribution with parameter p. 


DEFINITION. A discrete random variable X has a geometric distri- 
bution with parameter p, where 0 < p < 1, if its probability mass 
function is given by 


px(k) =P(X=k)=(1—p)*"'p fork =1,2,.... 


We denote this distribution by Geo(p). 


Figure 4.3 shows the probability mass function px and distribution function 
Fx of a Geo(4) distributed random variable. 


0.3 1.0 ae 
: a fx (a) 
0.8 a 
oe) J 
0.6 = 
. 0.4 7 
0.1 
7 px(k) 024 
0.0 ore 0.0 4+ — 
a a a) a rs a 
1 5 10 15 20 1 5 10 15 20 
k a 


Fig. 4.3. Probability mass function and distribution function of the Geo(4) distri- 
bution. 


QUICK EXERCISE 4.6 Let X have a Geo(p) distribution. For n > 0, show that 
P(X >n)=(1—p)”. 


50 4 Discrete random variables 


The geometric distribution has a remarkable property, which is known as the 
memoryless property.! For n,k =0,1,2,... one has 


P(X >n+k|X >k)=P(X >n). 
We can derive this equality using the result from Quick exercise 4.6: 


P({X > k+n}n{X > k}) 


P(X >n+k|X >k) = PX >) 


P(X>k+n)_ (1-p)"™* 
P(X>k) — (1~p)* 


=(1—p)"=P(X >n). 


4.5 Solutions to the quick exercises 


4.1 From Table 4.1, one finds that 
{5S = 8} = {(2,6), (3,5), (4,4), (5,3), (6, 2)}- 
4.2 From Table 4.1, one determines the following table. 
k 4 5 6 7 8 9 10 11 12 
P(S=k) 35 35 36 a BO 3S 36 36 36 
4.3 Since {X <a} ={X <a}U{X =a}, it follows that 
F(a) = P(X <a) =P(X <a) + P(X =a) =P(X < a)+p(a). 


Not very interestingly: this also holds if p(a) = 0. 


4.4 The probability that you answered the first question correctly and the 
second one incorrectly is given by oe = - , Rg = 0). Due to independence, 
this is equal to P(R; = 1) P(R2 =0) =4-#=4. 


4.5 Rewriting yields 


(ne) — HO RT@ eB Bea A) 


' In fact, the geometric distribution is the only discrete random variable with this 
property. 


4.6 Exercises 51 


4.6 There are two ways to show that P(X > n) = (1 — p)”. The easiest way is 
to realize that P(X > n) is the probability that we had “no success in the first 
n trials,” which clearly equals (1 — p)”. A more involved way is by calculation: 
P(X >n) =P(X =nt1)4+P(X =n4+2)4-- 
= (lg) pap" pe 
= (1—p)"p(1+ (l—p) + (1-p)? +---). 


If we recall from calculus that 


Co 


: if 1 
DD Ce 2 ass par cee ~) 


&=0 


the answer follows immediately. 


4.6 Exercises 


4.1 H Let Z represent the number of times a 6 appeared in two independent 
throws of a die, and let S and M be as in Section 4.1. 


a. Describe the probability distribution of Z, by giving either the probability 
mass function pz of Z or the distribution function Fz of Z. What type of 
distribution does Z have, and what are the values of its parameters? 

b. List the outcomes in the events {M = 2, Z = 0}, {S = 5, Z = 1}, and 
{5 = 8, Z =1}. What are their probabilities? 

c. Determine whether the events {M = 2} and {Z = 0} are independent. 


4.2 Let X be a discrete random variable with probability mass function p 
given by: 


3 
— 
a 
pane 
on La 
Co] 
fos) 
Nile 


and p(a) = 0 for all other a. 


a. Let the random variable Y be defined by Y = X?, ie., if X = 2, then 
Y =4. Calculate the probability mass function of Y. 

b. Calculate the value of the distribution functions of X and Y in a = 1, 
a =3/4,anda=7-3. 


4.3 LE] Suppose that the distribution function of a discrete random variable X 
is given by 


52 4 Discrete random variables 


0 fora<0O 

1 1 
Figs 7 forO<a<5 

4 for $<a<4 

1 fora>3 


Determine the probability mass function of X. 


4.4 You toss n coins, each showing heads with probability p, independently 
of the other tosses. Each coin that shows tails is tossed again. Let X be the 
total number of heads. 


a. What type of distribution does X have? Specify its parameter(s). 
b. What is the probability mass function of the total number of heads X? 
4.5 A fair die is thrown until the sum of the results of the throws exceeds 6. 


The random variable X is the number of throws needed for this. Let F’ be the 
distribution function of X. Determine F(1), F(2), and F'(7). 


4.6 ©] Three times we randomly draw a number from the following numbers: 


1 2.3. 


If X; represents the ith draw, 7 = 1,2,3, then the probability mass function 
of X; is given by 
a 12 3 
P(X;=a) 4 44 
and P(X; = a) = 0 for all other a. We assume that each draw is independent 


of the previous draws. Let X be the average of X,, X2, and X3, ice., 


5 X,+X24 X. 
pee ia 
3 
a. Determine the probability mass function py of X. 
b. Compute the probability that exactly two draws are equal to 1. 


4.7 LJ A shop receives a batch of 1000 cheap lamps. The odds that a lamp is 
defective are 0.1%. Let X be the number of defective lamps in the batch. 


a. What kind of distribution does X have? What is/are the value(s) of pa- 
rameter(s) of this distribution? 

b. What is the probability that the batch contains no defective lamps? One 
defective lamp? More than two defective ones? 


4.8 In Section 1.4 we saw that each space shuttle has six O-rings and that 
each O-ring fails with probability 


4.6 Exercises 53 
eatht 


a) = Ty earee 


where a = 5.085, b = —0.1156, and t is the temperature (in degrees Fahren- 
heit) at the time of the launch of the space shuttle. At the time of the fatal 
launch of the Challenger, t = 31, yielding p(31) = 0.8178. 


a. Let X be the number of failing O-rings at launch temperature 31°F. What 
type of probability distribution does X have, and what are the values of 
its parameters? 


b. What is the probability P(X > 1) that at least one O-ring fails? 


4.9 For simplicity’s sake, let us assume that all space shuttles will be launched 
at 81°F (which is the highest recorded launch temperature in Figure 1.3). With 
this temperature, the probability of an O-ring failure is equal to p(81) = 0.0137 
(see Section 1.4 or Exercise 4.8). 


a. What is the probability that during 23 launches no O-ring will fail, but 
that at least one O-ring will fail during the 24th launch of a space shuttle? 


b. What is the probability that no O-ring fails during 24 launches? 


4.10 H Early in the morning, a group of m people decides to use the elevator 
in an otherwise deserted building of 21 floors. Each of these persons chooses 
his or her floor independently of the others, and—from our point of view— 
completely at random, so that each person selects a floor with probability 
1/21. Let S;, be the number of times the elevator stops. In order to study 
Sm, we introduce for i = 1,2,...,21 random variables R;, given by 


R= 1 if the elevator stops at the ith floor 
‘10. if the elevator does not stop at the ith floor. 


a. Each R; has a Ber(p) distribution. Show that p = 1 — (20). 
b. From the way we defined S',, it follows that 
Sm = Ri + Rot+---+ Roi. 


Can we conclude that S,,, has a Bin(21,p) distribution, with p as in part a? 
Why or why not? 


c. Clearly, if m= 1, one has that P(.S; = 1) = 1. Show that for m = 2 


and that S3 has the following distribution. 
a 1 2 3 
P(S3 =a) 1/441 60/441 380/441 


54 4 Discrete random variables 


4.11 You decide to play monthly in two different lotteries, and you stop play- 
ing as soon as you win a prize in one (or both) lotteries of at least one million 
euros. Suppose that every time you participate in these lotteries, the proba- 
bility to win one million (or more) euros is p; for one of the lotteries and po 
for the other. Let MW be the number of times you participate in these lotteries 
until winning at least one prize. What kind of distribution does M have, and 
what is its parameter? 


4.12 ©) You and a friend want to go to a concert, but unfortunately only one 
ticket is still available. The man who sells the tickets decides to toss a coin 
until heads appears. In each toss heads appears with probability p, where 
0 < p< 1, independent of each of the previous tosses. If the number of tosses 
needed is odd, your friend is allowed to buy the ticket; otherwise you can buy 
it. Would you agree to this arrangement? 


4.13 & A box contains an unknown number JN of identical bolts. In order 
to get an idea of the size N, we randomly mark one of the bolts from the 
box. Next we select at random a bolt from the box. If this is the marked bolt 
we stop, otherwise we return the bolt to the box, and we randomly select a 
second one, etc. We stop when the selected bolt is the marked one. Let X be 
the number of times a bolt was selected. Later (in Exercise 21.11) we will try 
to find an estimate of N. Here we look at the probability distribution of X. 


a. What is the probability distribution of X? Specify its parameter(s)! 

b. The drawback of this approach is that X can attain any of the values 
1,2,3,..., so that if N is large we might be sampling from the box for 
quite a long time. We decide to sample from the box in a slightly different 
way: after we have randomly marked one of the bolts in the box, we 
select at random a bolt from the box. If this is the marked one, we stop, 
otherwise we randomly select a second bolt (we do not return the selected 
bolt). We stop when we select the marked bolt. Let Y be the number of 
times a bolt was selected. 

Show that P(Y =k) =1/N for k =1,2,...,N (Y has a so-called discrete 
uniform distribution). 

c. Instead of randomly marking one bolt in the box, we mark m bolts, with 
m smaller than N. Next, we randomly select r bolts; Z is the number of 
marked bolts in the sample. 

Show that 


P(Z =k) = for k=0,1,2,...,r. 


(Z has a so-called hypergeometric distribution, with parameters m, N, 
and r.) 


4.14 We throw a coin until a head turns up for the second time, where p is the 
probability that a throw results in a head and we assume that the outcome 


4.6 Exercises 55 


of each throw is independent of the previous outcomes. Let X be the number 
of times we have thrown the coin. 


a. Determine P(X = 2), P(X =3), and P(X = 4). 
b. Show that P(X =n) = (n—1)p?(1— p)”~? for n > 2. 


5 


Continuous random variables 


Many experiments have outcomes that take values on a continuous scale. For 
example, in Chapter 2 we encountered the load at which a model of a bridge 
collapses. These experiments have continuous random variables naturally as- 
sociated with them. 


5.1 Probability density functions 


One way to look at continuous random variables is that they arise by a (never- 
ending) process of refinement from discrete random variables. Suppose, for 
example, that a discrete random variable associated with some experiment 
takes on the value 6.283 with probability p. If we refine, in the sense that we 
also get to know the fourth decimal, then the probability p is spread over the 
outcomes 6.2830, 6.2831, ...,6.2839. Usually this will mean that each of these 
new values is taken on with a probability that is much smaller than p—the 
sum of the ten probabilities is p. Continuing the refinement process to more 
and more decimals, the probabilities of the possible values of the outcomes 
become smaller and smaller, approaching zero. However, the probability that 
the possible values lie in some fixed interval [a,b] will settle down. This is 
closely related to the way sums converge to an integral in the definition of the 
integral and motivates the following definition. 


DEFINITION. A random variable X is continuous if for some function 
f : R— R and for any numbers a and 6 with a < b, 


Pasx<t=f fade 


The function f has to satisfy f(x) > 0 for all z and f°. f(x) dx = 1. 
We call f the probability density function (or probability density) 
of X. 


58 5 Continuous random variables 


a b 


Fig. 5.1. Area under a probability density function f on the interval [a, 6]. 


Note that the probability that X lies in an interval [a,b] is equal to the area 
under the probability density function f of X over the interval [a,b]; this 
is illustrated in Figure 5.1. So if the interval gets smaller and smaller, the 
probability will go to zero: for any positive € 


a+e 
Paa-e<X<ate)= f(a) de, 


a—eé 


and sending « to 0, it follows that for any a 


This implies that for continuous random variables you may be careless about 
the precise form of the intervals: 


Pia<X<b)=P(a<X<b)=P(a<X <b)=P(a<X <b). 


What does f(a) represent? Note (see also Figure 5.2) that 


a+e 
Paa-e<X<ate)= f(x) dx & 2e f(a) (5.1) 


a—é 


for small positive «. Hence f(a) can be interpreted as a (relative) measure of 
how likely it is that X will be near a. However, do not think of f(a) as a 
probability: f(a) can be arbitrarily large. An example of such an f is given in 
the following exercise. 


QUICK EXERCISE 5.1 Let the function f be defined by f(a) = 0 if a < 0 
or x > 1, and f(x) = 1/(2\/z) for 0 < x < 1. You can check quickly that 
f satisfies the two properties of a probability density function. Let X be 
a random variable with f as its probability density function. Compute the 
probability that X lies between 10~* and 1077. 


5.1 Probability density functions 59 


a-€é at+e 


Fig. 5.2. Approximating the probability that X lies e-close to a. 


You should realize that discrete random variables do not have a probability 
density function f and continuous random variables do not have a probability 
mass function p, but that both have a distribution function F(a) = P(X < a). 
Using the fact that for a < b the event {X < }} is a disjoint union of the 
events {X <a} and {a < X < b}, we can express the probability that X lies 
in an interval (a, b] directly in terms of F' for both cases: 


P(a < X <b) =P(X <b) — P(X <a) = F(b) — F(a). 


There is a simple relation between the distribution function F and the prob- 
ability density function f of a continuous random variable. It follows from 
integral calculus that 


b 
d 
F(b) = / f(z)de and! fa) = F(x), 
aes x 
Both the probability density function and the distribution function of a con- 
tinuous random variable X contain all the probabilistic information about X; 
the probability distribution of X is described by either of them. 


We illustrate all this with an example. Suppose we want to make a probability 
model for an experiment that can be described as “an object hits a disc of 
radius r in a completely arbitrary way” (of course, this is not you playing 
darts—nevertheless we will refer to this example as the darts example). We 
are interested in the distance X between the hitting point and the center of 
the disc. Since distances cannot be negative, we have F'(b) = P(X < b) =0 
when b < 0. Since the object hits the disc, we have F'(b) = 1 when b > r. That 
the dart hits the disk in a completely arbitrary way we interpret as that the 
probability of hitting any region is proportional to the area of that region. In 
particular, because the disc has area 7r? and the disc with radius b has area 
mb”, we should put 


' This holds for all « where f is continuous. 


60 5 Continuous random variables 


mb? b? 
Then the probability density function f of X is equal to 0 outside the interval 
[0,7] and 
f(x) = a= fas for0 <a <r. 
QUICK EXERCISE 5.2 Compute for the darts example the probability that 
0< X <r/2, and the probability that r/2< X <r. 


5.2 The uniform distribution 


In this section we encounter a continuous random variable that describes an 
experiment where the outcome is completely arbitrary, except that we know 
that it lies between certain bounds. Many experiments of physical origin have 
this kind of behavior. For instance, suppose we measure for a long time the 
emission of radioactive particles of some material. Suppose that the experi- 
ment consists of recording in each hour at what times the particles are emitted. 
Then the outcomes will lie in the interval [0,60] minutes. If the measurements 
would concentrate in any way, there is either something wrong with your 
Geiger counter or you are about to discover some new physical law. Not con- 
centrating in any way means that subintervals of the same length should have 
the same probability. It is then clear (cf. equation (5.1)) that the probability 
density function associated with this experiment should be constant on [0,60]. 
This motivates the following definition. 


DEFINITION. A continuous random variable has a uniform distribu- 
tion on the interval [a, J] if its probability density function f is given 
by f(x) = 0 if x is not in [a, 6] and 


1 
B-a 
We denote this distribution by U(a, (3). 


fora<a</. 


f(z) = 


QUICK EXERCISE 5.3 Argue that the distribution function F' of a random 
variable that has a U(a,() distribution is given by F(x) = 0 if x < a, 
F(x) =1if«> 6, and F(x) =(#-a)/(G—a) fora<a< fp. 


In Figure 5.3 the probability density function and the distribution function of 
a U(0, $) distribution are depicted. 


5.3 The exponential distribution 61 


Ps F 
1 
0 — 0 
i ——— ff — 7 er __—? A 
0 1/3 0 1/3 


Fig. 5.3. The probability density function and the distribution function of the 
U(0, +) distribution. 


5.3 The exponential distribution 


We already encountered the exponential distribution in the chemical reactor 
example of Chapter 3. We will give an argument why it appears in that ex- 
ample. Let v be the effluent volumetric flow rate, i.e., the volume that leaves 
the reactor over a time interval [0,t] is vt (and an equal volume enters the 
vessel at the other end). Let V be the volume of the reactor vessel. Then in 
total a fraction (v/V) -¢t will have left the vessel during [0,t], when ¢ is not 
too large. Let the random variable T be the residence time of a particle in 
the vessel. To compute the distribution of T’, we divide the interval [0, ¢] in 
n small intervals of equal length t/n. Assuming perfect mixing, so that the 
particle’s position is uniformly distributed over the volume, the particle has 
probability p = (v/V)-t/n to have left the vessel during any of the n intervals 
of length t/n. If we assume that the behavior of the particle in different time 
intervals of length t/n is independent, we have, if we call “leaving the vessel” 
a success, that T’ has a geometric distribution with success probability p. It 
follows (see also Quick exercise 4.6) that the probability P(Z' >t) that the 
particle is still in the vessel at time ¢ is, for large n, well approximated by 


(1—p)"= a-=). 


But then, letting n — co, we obtain (recall a well-known limit from your 
calculus course) 


t 1 n e 
P(T >t) = lim (1-5 .-) ~ et. 


noo V mr 


It follows that the distribution function of T equals 1 — e~v*, and differenti- 
ating we obtain that the probability density function fr of T is equal to 


62 5 Continuous random variables 
+t Ui _wt 
fr(t)=—(1-e-v al vé for t>0. 
This is an example of an exponential distribution, with parameter v/V. 


DEFINITION. A continuous random variable has an exponential dis- 
tribution with parameter A if its probability density function f is 
given by f(a) =0 if « <0 and 


f(z) =Ae**—s for x > 0. 
We denote this distribution by Exp (A). 
The distribution function F' of an Exp(A) distribution is given by 
F(a)=1-—e°-** for a>0. 


In Figure 5.4 we show the probability density function and the distribution 
function of the Exp(0.25) distribution. 


1.0 
F 

0.2 0.8 
0.6 
Qt 0.4 
0.2 
0.0 0.0 

TT TTT 

—5 0 5 10 15 20 


Fig. 5.4. The probability density and the distribution function of the Exp (0.25) 
distribution. 


Since we obtained the exponential distribution directly from the geometric 
distribution it should not come as a surprise that the exponential distribution 
also satisfies the memoryless property, i.e., if X has an exponential distribu- 
tion, then for all s,t > 0, 


P(X >s+t|X >s)=P(X >t). 
Actually, this follows directly from 


P(X >s+t Mere) 
P(X > s-+t]X > 9) = “SEY SE Le PKS 0), 


5.4 The Pareto distribution 63 


QUICK EXERCISE 5.4 A study of the response time of a certain computer sys- 
tem yields that the response time in seconds has an exponentially distributed 
time with parameter 0.25. What is the probability that the response time 
exceeds 5 seconds? 


5.4 The Pareto distribution 


More than a century ago the economist Vilfredo Pareto ([20]) noticed that 
the number of people whose income exceeded level x was well approximated 
by C/«x®, for some constants C' and a > 0 (it appears that for all countries 
a@ is around 1.5). A similar phenomenon occurs with city sizes, earthquake 
rupture areas, insurance claims, and sizes of commercial companies. When 
these quantities are modeled as realizations of random variables X, then their 
distribution functions are of the type F(a) = 1—1/ax° for x > 1. (Here 
1 is a more or less arbitrarily chosen starting point—what matters is the 
behavior for large x.) Differentiating, we obtain probability densities of the 
form f(x) = a/x°*!. This motivates the following definition. 


DEFINITION. A continuous random variable has a Pareto distribution 
with parameter a > 0 if its probability density function f is given 
loy (@) = © Wi xe < 1 aval 
a 
f(a) = Aaa for x 2 il. 


We denote this distribution by Par(a). 


0.5 1.0 
F 
0.4 0.8 
0.3 0.6 
0.2 0.4 
0.1 0.2 
0.0 0.0 


| 
0 2 4 6 8 10 12 


Fig. 5.5. The probability density and the distribution function of the Par(0.5) 
distribution. 


64 5 Continuous random variables 


In Figure 5.5 we depicted the probability density f and the distribution func- 
tion F' of the Par(0.5) distribution. 


5.5 The normal distribution 


The normal distribution plays a central role in probability theory and statis- 
tics. One of its first applications was due to C.F. Gauss, who used it in 1809 
to model observational errors in astronomy; see [13]. We will see in Chap- 
ter 14 that the normal distribution is an important tool to approximate the 
probability distribution of the average of independent random variables. 


DEFINITION. A continuous random variable has a normal distribu- 
tion with parameters pz and o? > 0 if its probability density function 
f is given by 


ie 


#4 
e ) for —co < 4% < o. 
Qi 


We denote this distribution by N(p, 07). 


In Figure 5.6 the graphs of the probability density function f and distribution 
function F' of the normal distribution with = 3 and a? = 6.25 are displayed. 


0.20 1.0 
F 
0.8 
0.15 ‘a 
0.6 
0.10 
0.4 
0.05 
0.2 
0.00 0.0 
en ne ns es ne 
23 0 3 6 9 = 0 3 6 9 


Fig. 5.6. The probability density and the distribution function of the N(3, 6.25) 
distribution. 


If X has an N(, 07) distribution, then its distribution function is given by 


y-[ — aC dx for -—co<a<oo. 
oo Oo ve” 


5.6 Quantiles 65 


Unfortunately there is no explicit expression for F’; f has no antiderivative. 
However, as we shall see in Chapter 8, any N(j1, 07) distributed random vari- 
able can be turned into an N(0,1) distributed random variable by a simple 
transformation. As a consequence, a table of the N(0,1) distribution suffices. 
The latter is called the standard normal distribution, and because of its special 
role the letter ¢@ has been reserved for its probability density function: 


d(x) = e-2” for —c0 <4 < 00. 


Note that ¢ is symmetric around zero: ¢(—x) = (x) for each x. The corre- 
sponding distribution function is denoted by ®. The table for the standard nor- 
mal distribution (see Table B.1) does not contain the values of ®(a), but rather 
the so-called right tail probabilities 1— ®(a). If, for instance, we want to know 
the probability that a standard normal random variable Z is smaller than or 
equal to 1, we use that P(Z <1) = 1—P(Z > 1). In the table we find that 
P(Z > 1) = 1—®(1) is equal to 0.1587. Hence P(Z < 1) = 1—0.1587 = 0.8413. 
With the table you can handle tail probabilities with numbers a given to two 
decimals. To find, for instance, P(Z > 1.07), we stay in the same row in the 
table but move to the seventh column to find that P(Z > 1.07) = 0.1423. 


QUICK EXERCISE 5.5 Let the random variable Z have a standard normal 
distribution. Use Table B.1 to find P(Z < 0.75). How do you know—without 
doing any calculations—that the answer should be larger than 0.5? 


5.6 Quantiles 


Recall the chemical reactor example, where the residence time 7’, measured 
in minutes, has an exponential distribution with parameter A = v/V = 0.25. 
As we shall see in the next chapters, a consequence of this choice of X is that 
the mean time the particle stays in the vessel is 4 minutes. However, from the 
viewpoint of process control this is not the quantity of interest. Often, there 
will be some minimal amount of time the particle has to stay in the vessel to 
participate in the chemical reaction, and we would want that at least 90% of 
the particles stay in the vessel this minimal amount of time. In other words, 
we are interested in the number g with the property that P(T > q) = 0.9, or 
equivalently, 
P(T <q) =0.1. 


The number q is called the 0.1th quantile or 10th percentile of the distribution. 
In the case at hand it is easy to determine. We should have 


Pf <q) =1=e-8 "9 = 0.1, 


This holds exactly when e~9-7°4 = 0.9 or when —0.25g = In(0.9) = —0.105. 
So q = 0.42. Hence, although the mean residence time is 4 minutes, 10% of 


66 5 Continuous random variables 


the particles stays less than 0.42 minute in the vessel, which is just slightly 
more than 25 seconds! We use the following general definition. 


DEFINITION. Let X be a continuous random variable and let p be a 
number between 0 and 1. The pth quantile or 100pth percentile of 
the distribution of X is the smallest number q, such that 


F(qp) = P(X < gp) =D. 
The median of a distribution is its 50th percentile. 


QUICK EXERCISE 5.6 What is the median of the U(2,7) distribution? 


For continuous random variables qp is often easy to determine. Indeed, if F’ is 
strictly increasing from 0 to 1 on some interval (which may be infinite to one 
or both sides), then 

= F®’(p), 
where FY is the inverse of F. This is illustrated in Figure 5.7 for the 
Exp (0.25) distribution. 


0 dp 20 
Fig. 5.7. The pth quantile gp of the Exp(0.25) distribution. 


For an exponential distribution it is easy to compute quantiles. This is dif- 
ferent for the standard normal distribution, where we have to use a table 
(like Table B.1). For example, the 90th percentile of a standard normal is the 
number go.9 such that ®(go.9) = 0.9, which is the same as 1 — ®(qo.9) = 0.1, 
and the table gives us go.9 = 1.28. This is illustrated in Figure 5.8, with both 
the probability density function and the distribution function of the standard 
normal distribution. 


QUICK EXERCISE 5.7 Find the 0.95th quantile go.95 of a standard normal 
distribution, accurate to two decimals. 


5.7 Solutions to the quick exercises 67 


Fig. 5.8. The 90th percentile of the N(0,1) distribution. 


5.7 Solutions to the quick exercises 


5.1 We know from integral calculus that for0 <a<b<1 
b bo 4 
| f(a)de = f eo Vb — Va. 


Hence f°. f(z) dz = fo 1/(2,/x) dx = 1 (so f is a probability density 
function—nonnegativity being obvious), and 
10~? 1 


P(10-4 < X < 107? =| ——d 
( = ) ig ve 


= V10-? — V10-4 = 107" — 107? = 0.09. 
Actually, the random variable X arises in a natural way; see equation (7.1). 


5.2 We have P(0 < X <r/2) = F(r/2) — F(0) = (1/2)? — 0? = 1/4, and 
P(r/2< X <r) = F(r)—F(r/2) = 1-1/4 = 3/4, no matter what the radius 
of the disc is! 


5.3 Since f(x) = 0 for x < a, we have F(x) = Oif x < a. Also, since f(x) = 0 
for alla > 6, F(x) =1 if x > GB. In between 


ey ed y | «£2-a 
F=f sta)av= f a a ls4o| - FS 


In other words; the distribution function increases linearly from the value 0 
in a to the value 1 in (. 


5.4 If X is the response time, we ask for P(X > 5). This equals 
PUL 5) eae = 09808. 


68 5 Continuous random variables 


5.5 In the eighth row and sixth column of the table, we find that 1— (0.75) = 
0.2266. Hence the answer is 1 — 0.2266 = 0.7734. Because of the symmetry of 
the probability density ¢, half of the mass of a standard normal distribution 
lies on the negative axis. Hence for any number a > 0, it should be true that 
P(Z <a) >P(Z<0)=05. 


5.6 The median is the number go.5 = FY (0.5). You either see directly that 
you have got half of the mass to both sides of the middle of the interval, hence 
go.5 = (2+ 7)/2 = 4.5, or you solve with the distribution function: 


1 —2 
5 FM => 7 and so q= 4.5. 


5.7 Since ®(qo.95) = 0.95 is the same as 1 — ®(qo.95) = 0.05, the table gives 
us go.95 = 1.64, or more precisely, if we interpolate between the fourth and 
the fifth column; 1.645. 


5.8 Exercises 


5.1 Let X be a continuous random variable with probability density function 


forO<a<l 
for2<a<3 
elsewhere. 


f(x) = 


© Ale Blew 


a. Draw the graph of f. 
b. Determine the distribution function F' of X, and draw its graph. 


5.2 © Let X be a random variable that takes values in [0,1], and is further 
given by 


F(a)=2? for0<2#<1. 
Compute P(s <X< 3). 


5.3 Let a continuous random variable X be given that takes values in [0, 1], 
and whose distribution function F satisfies 


F(a) =22?-2* for0<a2<1. 


a. Compute P( <X< 3). 
b. What is the probability density function of X? 


5.4 H Jensen, arriving at a bus stop, just misses the bus. Suppose that he 
decides to walk if the (next) bus takes longer than 5 minutes to arrive. Suppose 
also that the time in minutes between the arrivals of buses at the bus stop is 
a continuous random variable with a U(4,6) distribution. Let X be the time 
that Jensen will wait. 


5.8 Exercises 69 


a. What is the probability that X is less than 43 (minutes)? 
b. What is the probability that X equals 5 (minutes)? 
c. Is X a discrete random variable or a continuous random variable? 


5.5 LJ The probability density function f of a continuous random variable X 


is given by: 
cx#+3 for —-3<a<—-2 


f(z) =43-cr for2<a2<3 
0 elsewhere. 


a. Compute c. 
b. Compute the distribution function of X. 


5.6 Let X have an Exp(0.2) distribution. Compute P(X > 5). 


5.7 The score of a student on a certain exam is represented by a number 
between 0 and 1. Suppose that the student passes the exam if this number 
is at least 0.55. Suppose we model this experiment by a continuous random 
variable S, the score, whose probability density function is given by 


Ax forO<a< 4 
f(@)=44-4¢ for$<a<l 
0 elsewhere. 


a. What is the probability that the student fails the exam? 
b. What is the score that he will obtain with a 50% chance, in other words, 
what is the 50th percentile of the score distribution? 


5.8 H Consider Quick exercise 5.2. For another dart thrower it is given that his 
distance to the center of the disc Y is described by the following distribution 


function: 
G(b) = 'D for0O<b<r 
rT 


and G(b) = 0 for b < 0, G(b) = 1 for b>r. 


a. Sketch the probability density function g(y) = ZGly). 

b. Is this person “better” than the person in Quick exercise 5.2? 

c. Sketch a distribution function associated to a person who in 90% of his 
throws hits the disc no further than 0.1-r of the center. 


5.9 EJ Suppose we choose arbitrarily a point from the square with corners at 
(2,1), (3,1), (2,2), and (3,2). The random variable A is the area of the triangle 
with its corners at (2,1), (3,1) and the chosen point (see Figure 5.9). 


a. What is the largest area A that can occur, and what is the set of points 
for which A < 1/4? 


70 5 Continuous random variables 


(2, 2) (3, 2) 


randomly chosen 
point 


(2,1) (3,1) 


Fig. 5.9. A triangle in a square. 


b. Determine the distribution function F' of A. 

c. Determine the probability density function f of A. 

5.10 Consider again the chemical reactor example with parameter \ = 0.5. 
We saw in Section 5.6 that 10% of the particles stay in the vessel no longer 


than about 12 seconds—while the mean residence time is 2 minutes. Which 
percentage of the particles stay no longer than 2 minutes in the vessel? 


5.11 Compute the median of an Ezp(A) distribution. 


5.12 FE] Compute the median of a Par(1) distribution. 


5.13 H We consider a random variable Z with a standard normal distribution. 


a. Show why the symmetry of the probability density function ¢ of Z implies 
that for any a one has ®(—a) = 1 — ®(a). 


b. Use this to compute P(Z < —2). 


5.14 Determine the 10th percentile of a standard normal distribution. 


6 


Simulation 


Sometimes probabilistic models are so complex that the tools of mathemat- 
ical analysis are not sufficient to answer all relevant questions about them. 
Stochastic simulation is an alternative approach: values are generated for the 
random variables and inserted into the model, thus mimicking outcomes for 
the whole system. It is shown in this chapter how one can use uniform ran- 
dom number generators to mimic random variables. Also two larger simulation 
examples are presented. 


6.1 What is simulation? 


In many areas of science, technology, government, and business, models are 
used to gain understanding of some part of reality (the portion of interest is 
often referred to as “the system”). Sometimes these are physical models, such 
as a scale model of an airplane in a wind tunnel or a scale model of a chemical 
plant. Other models are abstract, such as macroeconomic models consisting 
of equations relating things like interest rates, unemployment, and inflation 
or partial differential equations describing global weather patterns. 


In simulation, one uses a model to create specific situations in order to study 
the response of the model to them and then interprets this in terms of what 
would happen to the system “in the real world.” In this way, one can carry 
out experiments that are impossible, too dangerous, or too expensive to do 
in the real world—addressing questions like: What happens to the average 
temperature if we reduce the greenhouse gas emissions globally by 50%? Can 
the plane still fly if engines 3 and 4 stop in midair? What happens to the 
distribution of wealth if we halve the tax rate? 


More specifically, we focus on situations and problems where randomness or 
uncertainty or both play a significant or dominant role and should be modeled 
explicitly. Models for such systems involve random variables, and we speak of 
probabilistic or stochastic models. Simulating them is stochastic simulation. In 


72 6 Simulation 


the preceding chapters we have encountered some of the tools of probability 
theory, and we will encounter others in the chapters to come. With these tools 
we can compute quantities of interest explicitly for many models. Stochastic 
simulation of a system means generating values for all the random variables 
in the model, according to their specified distributions, and recording and 
analyzing what happens. We refer to the generated values as realizations of 
the random variables. 


For us, there are two reasons to learn about stochastic simulation. The first is 
that for complex systems, simulation can be an alternative to mathematical 
analysis, sometimes the only one. The second reason is that through simula- 
tion, we can get more feeling for random variables, and this is why we study 
stochastic simulation at this point in the book. We start by asking how we 
can generate a realization of a random variable. 


6.2 Generating realizations of random variables 


Simulations are almost always done using computers, which usually have one 
or more so-called (pseudo) random number generators. A call to the random 
number generator returns a random number between 0 and 1, which mimics 
a realization of a U(0,1) variable. With this source of uniform (pseudo) ran- 
domness we can construct any random variable we want by transforming the 
outcome, as we shall see. 


QUICK EXERCISE 6.1 Describe how you can simulate a coin toss when instead 
of a coin you have a die. Any ideas on how to simulate a roll of a die if you 
only have a coin? 


Bernoulli random variables 


Suppose U has a U(0,1) distribution. To construct a Ber(p) random variable 
for some 0 < p < 1, we define 


_ 4. Bo < yy, 
10 ifU>p 
so that 
P(X =1)=P(U < p)=p, 
P(X =0) =P(U>p)=1-p. 
This random variable X has a Bernoulli distribution with parameter p. 


QUICK EXERCISE 6.2 A random variable Y has outcomes 1, 3, and 4 with the 
following probabilities: P(Y =1) = 3/5, P(Y =3) = 1/5, and P(Y =4) = 
1/5. Describe how to construct Y from a U(0,1) random variable. 


6.2 Generating realizations of random variables 73 


Continuous random variables 


Suppose we have the distribution function F' of a continuous random variable 
and we wish to construct a random variable with this distribution. We show 
how to do this if F is strictly increasing from 0 to 1 on an interval. In that 
case F has an inverse function F™. Figure 6.1 shows an example: F is strictly 
increasing on the interval [2, 10]; the inverse FY is a function from the interval 
(0, 1] to the interval [2, 10]. 


2 POV (4) x 10 


Fig. 6.1. Simulating a continuous random variable using the distribution function. 


Note how u relates to F’(u) as F(x) relates to x. We see that u < F(z) 
is equivalent with F'"Y(u) < x. If instead of a real number u we consider a 
U(0,1) random variable U, we obtain that the corresponding events are the 
same: 


{U < F(x)} = {F™(U) < 2}. (6.1) 


We know about the U(0,1) random variable U that P(U < b) = b for any 
number 0 < 6 < 1. Substituting b = F(a) we see 


PU < F(z)) = F(z). 
From equality (6.1), therefore, 
PF") <a) =P (a); 


in other words, the random variable F™Y(U) has distribution function F. 


What remains is to find the function FY. From Figure 6.1 we see 
F(e)=u & «=F (u), 


so if we solve the equation F(a) = u for x, we obtain the expression for 
F(x). 


74 6 Simulation 


Exponential random variables 


We apply this method to the exponential distribution. On the interval [0, oo), 
the Exp(A) distribution function is strictly increasing and given by 


F(z) =1-e*. 


To find F™’, we solve the equation F(z) = u: 
F(2)=u @& 1-e*=4u 
& e7=1-4 
= —Agr =In(1- wu) 
1 
S&S @= —7 mG — 4), 
so FY (u) = —$In(1—w) and if U has a U(0, 1) distribution, then the random 
variable X defined by 


x= PU = -+m(1 -U) 


has an Exp(X) distribution. 


In practice, one replaces 1—U with U, because both have a U(0, 1) distribution 
(see Exercise 6.3). Leaving out the subtraction leads to more efficient computer 
code. So instead of X we may use 


1 
Y= ~y In(U), 
which also has an Exp(A) distribution. 


QUICK EXERCISE 6.3 A distribution function F is 0 for « <1 and 1 for x > 3, 
and F(x) = $(a—1)? if 1 < x < 3. Let U be a U(0,1) random variable. 
Construct a random variable with distribution F from U. 


Remark 6.1 (The general case). The restriction we imposed earlier, 
that the distribution function should be strictly increasing, is not really 
necessary. Furthermore, a distribution function with jumps or a flat section 
somewhere in the middle is not a problem either. We illustrate this with an 
example in Figure 6.2. 

This F has a jump at 4 and so for a corresponding X we should have 
P(X =4) = 0.2, the size of the jump. We see that whenever U is in the 
interval [0.3, 0.5], it is mapped to 4 by our method, and that this happens 
with exactly the right probability! 

The flat section of F' between 7 and 8 seems to pose a problem: the equa- 
tion F(a) = 0.85 has as its solution any a between 7 and 8, and we can- 
not define a unique inverse. This, however, does not really matter, because 
P(U = 0.85) = 0, and we can define the inverse F'"Y(0.85) in any way we 
want. Taking the left endpoint, here the number 7, agrees best with the 
definition of quantiles (see page 66). 


6.3 Comparing two jury rules 75 


2 4 7 8 10 


Fig. 6.2. A distribution function with a jump and a flat section. 


Remark 6.2 (Existence of random variables). The previous remark 
supplies a sketchy argument for the fact that any nondecreasing, rightcon- 
tinuous function F’, with limz.—. F(x) = 0 and limz—.. F(x) = 1, is the 
distribution of some random variable. 


Generating sequences 


For simulations we often want to generate realizations for a large number of 
random variables. Random number generators have been designed with this 
purpose in mind: each new call mimics a new U(0,1) random variable. The 
sequence of numbers thus generated is considered as a realization of a sequence 
of U(0,1) random variables U;, U2, U3,...with the special property that the 
events {U; < a;} are independent! for every choice of the a;. 


6.3 Comparing two jury rules 


At the Olympic Games there are several sports events that are judged by a 
jury, including gymnastics, figure skating, and ice dancing. During the 2002 
winter games a dispute arose concerning the gold medal in ice dancing: there 
were allegations that the Russian team had bribed a French jury member, 
thereby causing the Russian pair to win just ahead of the Canadians. We look 
into operating rules for juries, although we leave the effects of bribery to the 
exercises (Exercise 6.11). 


Suppose we have a jury of seven members, and for each performance each 
juror assigns a grade. The seven grades are to be transformed into a final 
score. Two rules to do this are under consideration, and we want to choose 


' In Chapter 9 we return to the question of independence between random variables. 


76 6 Simulation 


the better one. For the first one, the highest and lowest scores are removed 
and the final score is the average of the remaining five. For the second rule, 
the scores are put in ascending order and the middle one is assigned as final 
score. Before you continue reading, consider which rule is better and how you 
can verify this. 


A probabilistic model 


For our investigation we assume that the scores the jurors assign deviate by 
some random amount from the true or deserved score. We model the score 
that juror 7 assigns when the performance deserves a score g by 


Yi=g+Z;,  fori=1.,...,7, (6.2) 


where Z,,...,Z7 are random variables with values around zero. Let h; and 
hg be functions implementing the two rules: 


hi(yi,---,Y7) = average of the middle five of y1,..., y7, 
ho(yi1,---,Y7) = middle value of y1,..., yz. 


We are interested in deviations from the deserved score g: 


PS hy (Vi <0, Ye G; 


6.3 
i hs Yass HS! ed 


The distributions of T and M depend on the individual jury grades, and 
through those, on the juror-deviations 721, Z2,...,27, which we model as 
U(—0.5, 0.5) variables. This more or less finishes the modeling phase: we have 
given a stochastic model that mimics the workings of a jury and have defined, 
in terms of the variables in the model, the random variables TJ’ and M that 
represent the errors that result after application of the jury rules. 


In any serious application, the model should be validated. This means that 
one tries to gather evidence to convince oneself and others that the model 
adequately reflects the workings of the real system. In this chapter we are 
more interested in showing what you can do with simulation once you have a 
model, so we skip the validation. 


The next phase is analysis: which of the deviations is closer to zero? Because 
T and M are random variables, we would have to clarify what we mean by 
that, and answering the question certainly involves computing probabilities 
about T and M. We cannot do this with what we have learned so far, but we 
know how to simulate, so this is what we do. 


Simulation 


To generate a realization of a U(—0.5,0.5) random variable, we only need to 
subtract 0.5 from the result we obtain from a call to the random generator. 


6.3 Comparing two jury rules 77 


We do this 7 times and insert the resulting values in (6.2) as jury deviations 
Z1,..., 27, and substitute them in equations (6.3) to obtain T and M (the 
value of g is irrelevant: it drops out of the calculation): 


T = average of the middle five of Z1,..., Z7, 


. (6.4) 
M = middle value of Z1,..., 27. 


In simulation terminology, this is called a run: we have gone through the whole 
procedure once, inserting realizations for the random variables. If we repeat 
the whole procedure, we have a second run; see Table 6.1 for the results of 
five runs. 


Table 6.1. Simulation results for the two jury rules. 


Run Zi 22 23 ZA Z5 Z6 Zz L M 
1 —0.45 —0.08 —0.38 0.11 —0.42 0.48 0.02 —0.15 —0.08 
2 —0.37 —0.18 0.05 —0.10 0.01 0.28 960.31 0.01 0.01 
3 0.08 0.07 0.47 0.21 0.33 0.22 0.48 —0.12 —0.21 
4 0.24 0.08 —O.11 0.19 —0.03 0.02 0.44 0.10 0.08 
5 0.10 0.18 0.39 0.24 0.36 0.25 0.20 —0.11 —0.24 


QUICK EXERCISE 6.4 The next realizations for Z1,..., Z7 are: —0.05, 0.26, 
0.25, 0.39, 0.22, 0.23, 0.13. Determine the corresponding realizations of T 
and M. 


Table 6.1 can be used to check some computations. We also see that the real- 
ization of T was closest to zero in runs 3 and 5, the realization of M was closest 
to zero in runs 1 and 4, and they were (about) the same in run 2. There is no 
clear conclusion from this, and even if there was, one could wonder whether 
the next five runs would yield the same picture. Because the whole process 
mimics randomness, one has to expect some variation—or perhaps a lot. In 
later chapters we will get a better understanding of this variation; for the 
moment we just say that judgment based on a large number of runs is better. 
We do one thousand runs and exchange the table for pictures. Figure 6.3 de- 
picts, for juror 1, a histogram of all the deviations from the true score g. For 
each interval of length 0.05 we have counted the number of runs for which the 
deviation of juror 1 fell in that interval. These numbers vary from about 40 
to about 60. 


This is just to get an idea about the results for an individual juror. In Fig- 
ure 6.4 we see histograms for the final scores. Comparing the histograms, it 
seems that the realizations of T’ are more concentrated near zero than those 
of M. 


78 6 Simulation 


60 


=. | a a eo 
—0.4 —0.2 0.0 0.2 0.4 


Fig. 6.3. Deviations of juror 1 from the deserved score, one thousand runs. 


150 150 
100 100 
50 50 
0 0 
a Ce 
—0.4 —0.2 0.0 0.2 0.4 —0.4 —0.2 0.0 0.2 0.4 
T M 


Fig. 6.4. One thousand realizations of T and M. 


However, the two histograms do not tell us anything about the relation be- 
tween T and M, so we plot the realizations of pairs (T, M) for all one thousand 
runs (Figure 6.5). From this plot we see that in most cases M and T go in 
the same direction: if T is positive, then usually M is also positive, and the 
same goes for negative values. In terms of the final scores, both rules generally 
overvalue and undervalue the performance simultaneously. On closer exami- 
nation, with help of the line drawn from (—0.5, —0.5) to (0.5, 0.5), we see that 
the T values tend to be a little closer to zero than the M values. 

This suggests that we make a histogram that shows the difference of the 
absolute deviations from true score. For rule 1 this absolute deviation is |T’, 
for rule 2 it is |M|. If the difference || — |T| is positive, then T is closer to 
zero than M, and the difference tells us by how much. A negative difference 


6.3 Comparing two jury rules 79 


0.4 
0.2 wt ? 
: ° de! sts . 
2 fue . 
M00 wah : 
. i oi : 
=0:2 Pa of * 
—0.4 ew, ha 
[——;|, «© | 1- |. + -i- —-?- 7. 2 
-0.4 —0.2 0.0 0.2 0.4 
L 


Fig. 6.5. Plot of the points (J, M), one thousand runs. 


means that M was closer. In Figure 6.6 all the differences are shown in a 
histogram. The bars to the right of zero represent 696 runs. So, in about 70% 
of the runs, rule 1 resulted in a final score that is closer to the true score than 
rule 2. In about 30% of the cases, rule 2 was better, but generally by a smaller 
amount, as we see from the histogram. 


150 


100 


—0.3 —0.2 —0.1 0.0 0.1 0.2 0.3 


Fig. 6.6. Differences |M| — |T| for one thousand runs. 


80 6 Simulation 


6.4 The single-server queue 


There are many situations in life where you stand in a line waiting for some 
service: when you want to withdraw money from a cash dispenser, borrow 
books at the library, be admitted to the emergency room at the hospital, or 
pump gas at the gas station. Many other queueing situations are hidden: an 
email message you send might be queued at the local server until it has sent 
all messages that were submitted ahead of yours; searching the Internet, your 
browser sends and receives packets of information that are queued at various 
stages and locations; in assembly lines, partly finished products move from 
station to station, each time waiting for the next component to be added. 


We are going to study one simple queueing model, the so-called single-server 
queue: it has one server or service mechanism, and the arriving customers 
await their turn in order of their arrival. For definiteness, think of an oasis 
with one big water well. People arrive at the well with bottles, jerry cans, and 
other types of containers, to pump water. The supply of water is large, but 
the pump capacity is limited. The pump is about to be replaced, and while it 
is clear that a larger pump capacity will result in shorter waiting times, more 
powerful pumps are also more expensive. Therefore, to prepare a decision that 
balances costs and benefits, we wish to investigate the relationship between 
pump capacity and system performance. 


Modeling the system 


A stochastic model is in order: some general characteristics are known, such 
as how many people arrive per day and how much water they take on average, 
but the individual arrival times and amounts are unpredictable. We introduce 
random variables to describe them: let T, be the time between the start at 
time zero and the arrival of the first customer, T> the time between the arrivals 
of the first and the second customer, 73 the time between the second and the 
third, etc.; these are called the interarrival times. Let S; be the length of time 
that customer 7 needs to use the pump; in standard terminology this is called 
the service time. This is our description so far: 


Arrivals at: T; T + T> Ti + To + T3 etc. 
Service times: S$} So S3 etc. 


The pump capacity v (liters per minute) is not a random variable but a model 
parameter or decision variable, whose “best” value we wish to determine. So 
if customer 7 requires R; liters of water, then her service time is 


To complete the model description, we need to specify the distribution of the 
random variables T; and R;: 


6.4 The single-server queue 81 


Interarrival times: every T; has an Ezp(0.5) distribution (minutes); 
Service requirement: every R; has a U(2,5) distribution (liters). 


This particular choice of distributions would have to be supported by evidence 
that they are suited for the system at hand: a validation step as suggested for 
the jury model is appropriate here as well. For many arrival type processes, 
however, the exponential distribution is reasonable as a model for the inter- 
arrival times (see Chapter 12). The particular uniform distribution chosen for 
the required amount of water says that all amounts between 2 and 5 liters are 
equally likely. So there is no sheik who owns a 5000-liter water truck in “our” 
oasis. 

To evaluate system performance, we want to extract from the model the wait- 
ing times of the customers and how busy it is at the pump. 


Waiting times 


Let W; denote the waiting time of customer 7. The first customer is lucky; 
the system starts empty, and so W; = 0. For customer 7 the waiting time 
depends on how long customer 7—1 spent in the system compared to the time 
between their respective arrivals. We see that if the interarrival time T; is long, 
relatively speaking, then customer 7 arrives after the departure of customer 
i—1, and so W; = 0: 


—— Wi-1 > Si-a W; =0 
Ti 
Arrival of Departure of Arrival of 
customer i — 1 customer i — 1 customer 7 


On the other hand, if customer 7 arrives before the departure, the waiting 
time W; equals whatever remains of W;_1 + Sj_1: 


Wea Si-1 W; = Wi-1 + Si-1 — Ti 
T; id Wi 
C-—- ia tO Ke ,-!!::*=“‘“CSé;é;*€*‘“‘“‘CSC;S; MF 
Arrival of Arrival of | Departure of 
customer 7 — 1 customer 7 customer 7 — 1 


Summarizing the two cases, we see obtain: 
W;= max{W;_1 + 5;-1—-T7i, O}. (6.5) 


To carry out a simulation, we start at time zero and generate realizations of 
the interarrival times (the T;) and service requirements (the R;) for as long 
as we want, computing the other quantities that follow from the model on the 
way. Table 6.2 shows the values generated this way, for two pump capacities 
(v = 2 and 3) for the first six customers. Note that in both cases we use the 
same realizations of T; and R;. 


82 6 Simulation 
Table 6.2. Results of a short simulation. 


Input realizations v=2 v=3 
Te Arr.time R; S; W; Si W; 


0.24 0.24 4.39 2.20 0 146 0O 
1.97 2.21 4.00 2.00 0.23 1.33 0 
1.73 3.94 2.33 1.17 0.50 0.78 0 
2.82 6.76 4.03 2.01 0 1.34 0 
1.01 7.77 4.17 2.09 1.00 1.39 0.33 
1.09 8.86 4.24 2.12 1.99 1.41 0.63 


ss 


aw»rwnr 


QUICK EXERCISE 6.5 The next four realizations are T7: 1.86; R7: 4.79; Ts: 
1.08; and Rg: 2.33. Complete the corresponding rows of the table. 


Longer simulations produce so many numbers that we will drown in them 
unless we think of something. First, we summarize the waiting times of the 
first n customers with their average: 
We oe en (6.6) 
n 
Then, instead of giving a table, we plot the pairs (n, W,,), for n = 1,2,... until 
the end of the simulation. In Figure 6.7 we see that both lines bounce up and 
down a bit. Toward the end, the average waiting time for pump capacity 3 is 
about 0.5 and for v = 2 about 2. In a longer simulation we would see each of 
the averages converge to a limiting value (a consequence of the so-called law 
of large numbers, the topic of Chapter 13). 


2.5 
2.0 
1.5 
1.0 


0.5 


Average of first n waiting times 


0.0 


Fig. 6.7. Averaged waiting times at the well, for pump capacity 2 and 3. 


6.4 The single-server queue 83 
Work-in-system 


To show how busy it is at the pump one could record how many customers are 
waiting in the queue and plot this quantity against time. A slightly different 
approach is to record at every moment how much work there is in the system, 
that is, how much time it would take to serve everyone present at that moment. 
For example, if I am halfway through filling my 4-liter jerry can and three 
persons are waiting who require 2, 3, and 5 liters, respectively, then there are 
12 liters to go; at v = 2, there is 6 minutes of work in the system, and at 
v = 3 just 4. 

The amount of work in the system just before a customer arrives equals the 
waiting time of that customer, because it is exactly the time it takes to finish 
the work for everybody ahead of her. The work-in-system at time ¢ tells us 
how long the wait would be 7f somebody were to arrive at t. For this reason, 
this quantity is also called the virtual waiting time. 


Figure 6.8 shows the work-in-system as a function of time for the first 15 
minutes, using the same realizations that were the basis for Table 6.2. In the 
top graph, corresponding to v = 2, the work in the system jumps to 2.20 
(which is the realization of R,/2) at t = 0.24, when the first customer arrives. 
So at t = 2.21, which is 1.97 later, there is 2.20 — 1.97 = 0.23 minute of work 
left; this is the waiting time for customer 2, who brings an amount of work 
of 2.00 minutes, so the peak at 1.97 is 0.23 + 2.00 = 2.23, etc. In the bottom 
graph we see the work-in-system reach zero more often, because the individual 
(work) amounts are 2/3 of what they are when v = 2. More often, arriving 


5 
5 4 
g 
n 
P 3 
q 
= 5 
ad 
eH 
-. 
0 
5 
5 4 
~ 
n 
Pp 3 
q 
Ay 
ad 
HH 
=! 
0 


Fig. 6.8. Work in system: top, v = 2; bottom, v = 3. 


84 6 Simulation 


5 

A 

ra 

5 
0 20 40 60 80 100 

iS 10 

al 

= 6 

yw a : 

S24. wy wii: \ aaa 

aaa We aA 

i: is) (eens cess (eas | 
0 20 40 60 80 100 


t 


Fig. 6.9. Work in system: top, v = 2; bottom, v = 3. 


customers find the queue empty and the pump not in use; they do not have 
to wait. 

In Figure 6.9 the work-in-system is depicted as a function of time for the 
first 100 minutes of our run. At pump capacity 2 the virtual waiting time 
peaks at close to 11 minutes after about 55 minutes, whereas with v = 3 the 
corresponding peak is only about 4 minutes. There also is a marked difference 
in the proportion of time the system is empty. 


6.5 Solutions to the quick exercises 


6.1 To simulate the coin, choose any three of the six possible outcomes of 
the die, report heads if one of these three outcomes turns up, and report tails 
otherwise. For example, heads if the outcome is odd, tails if it is even. 


To simulate the die using a coin is more difficult; one solution is as follows. 
Toss the coin three times and use the following conversion table to map the 
result: 


Coins HHH HHT HTH HTT THH THT 
Die 1 2 3 4 5 6 


Repeat the coin tosses if you get TTH or TTT. 


6.6 Exercises 85 


6.2 Let the U(0,1) variable be U and set: 


5? 
Y=¢3 if2<U <, 
4 ifU>4 


So, for example, P(Y = 3) = P(2 <U < 2) =¢. 


6.3 The given distribution function F' is strictly increasing between 1 and 3, 


so we use the method with F™’. Solve the equation F(x) = $(a—1)? =u 


for x. This yields x = 1+ 2,/u, so we can set X = 1+ 2VU. If you need to 
be convinced, determine F'x. 


6.4 In ascending order the values are —0.05, 0.13, 0.22, 0.23, 0.25, 0.26, 0.39, 
so for M we find 0.23, and for T (0.13 + 0.22 + 0.23 + 0.25 + 0.26)/5 = 0.22. 


6.5 We find: 


Input realizations v=2 v= 
1 T; Arr.time R; Si W; Si W; 


7 1.86 10.72 4.79 2.39 2.25 1.60 0.18 
8 1.08 11.80 2.33 1.16 3.57 0.78 0.70 


6.6 Exercises 


6.1 Let U have a U(0,1) distribution. 


a. Describe how to simulate the outcome of a roll with a die using U. 


b. Define Y as follows: round 6U + 1 down to the nearest integer. What are 
the possible outcomes of Y and their probabilities? 


6.2 FE] We simulate the random variable X = 1+2V/U constructed in Quick 
exercise 6.3. As realization for U we obtain from the pseudo random generator 
the number u = 0.3782739. 


a. What is the corresponding realization « of the random variable X? 

b. If the next call to the random generator yields u = 0.3, will the corre- 
sponding realization for X be larger or smaller than the value you found 
in a? 

c. What is the probability the next draw will be smaller than the value you 
found in a? 


86 6 Simulation 


6.3 Let U have a U(0,1) distribution. Show that Z = 1 — U has a U(0,1) 
distribution by deriving the probability density function or the distribution 
function. 


6.4 Let F be the distribution function as given in Quick exercise 6.3: F(x) 
is 0 for x < 1 and 1 for x > 3, and F(x) = 4(«—1)? if 1 < x < 3. In the 
answer it is claimed that X = 1+2V/U has distribution function F , where U 
isa U(0,1) random variable. Verify this by computing P(X < a) and checking 
that this equals F(a), for any a. 


6.5 We have seen that if U has a U(0,1) distribution, then X = —InU has 
an Exp(1) distribution. Check this by verifying that P(X < a) = 1-—e~® for 
a> 0. 


6.6 LE] Somebody messed up the random number generator in your computer: 
instead of uniform random numbers it generates numbers with an Exp(2) dis- 
tribution. Describe how to construct a U(0,1) random variable U from an 
Exp(2) distributed X. 

Hint: look at how you obtain an Exp(2) random variable from a U(0,1) ran- 
dom variable. 


6.7 H In models for the lifetimes of mechanical components one sometimes 
uses random variables with distribution functions from the so-called Weibull 
family. Here is an example: F(x) = 0 for « < 0, and 


F(x) =1- eb" for « > 0. 
Construct a random variable Z with this distribution from a U(0,1) variable. 


6.8 A random variable X has a Par(3) distribution, so with distribution func- 
tion F with F(x) = 0 for x <1, and F(x) =1—27? for x > 1. For details on 
the Pareto distribution see Section 5.4. Describe how to construct X from a 
U(0, 1) random variable. 


6.9 EJ In Quick exercise 6.1 we simulated a die by tossing three coins. Recall 
that we might need several attempts before succeeding. 


a. What is the probability that we succeed on the first try? 


b. Let N be the number of tries that we need. Determine the distribution 
of N. 


6.10 H There is usually more than one way to simulate a particular random 
variable. In this exercise we consider two ways to generate geometric random 
variables. 


a. We give you a sequence of independent U(0,1) random variables U1, U2, 
.... From this sequence, construct a sequence of Bernoulli random vari- 


6.6 Exercises 87 


ables. From the sequence of Bernoulli random variables, construct a (sin- 
gle) Geo(p) random variable. 


b. It is possible to generate a Geo(p) random variable using just one U(0, 1) 
random variable. If calls to the random number generator take a lot of 
CPU time, this would lead to faster simulation programs. Set A = — In(1— 
p) and let Y have a Exp(A) distribution. We obtain Z from Y by rounding 
to the nearest integer greater than Y. Note that Z is a discrete random 
variable, whereas Y is a continuous one. Show that, nevertheless, the event 
{Z > n} is the same as {Y > n}. Use this to compute P(Z > n) from the 
distribution of Y. What is the distribution of Z? (See Quick exercise 4.6.) 


6.11 Reconsider the jury example (see Section 6.3). Suppose the first jury 
member is bribed to vote in favor of the present candidate. 


a. How should you now model Y\? Describe how you can investigate which 
of the two rules is less sensitive to the effect of the bribery. 


b. The International Skating Union decided to adopt a rule similar to the 
following: randomly discard two of the jury scores, then average the re- 
maining scores. Describe how to investigate this rule. Do you expect this 
rule to be more sensitive to the bribery than the two rules already dis- 
cussed, or less sensitive? 


6.12 H A tiny financial model. To investigate investment strategies, con- 
sider the following: 


You can choose to invest your money in one particular stock or put it in a 
savings account. Your initial capital is € 1000. The interest rate r is 0.5% per 
month and does not change. The initial stock price is € 100. Your stochastic 
model for the stock price is as follows: next month the price is the same as 
this month with probability 1/2, with probability 1/4 it is 5% lower, and with 
probability 1/4 it is 5% higher. This principle applies for every new month. 
There are no transaction costs when you buy or sell stock. 

Your investment strategy for the next 5 years is: convert all your money to 
stock when the price drops below € 95, and sell all stock and put the money 
in the bank when the stock price exceeds € 110. 


Describe how to simulate the results of this strategy for the model given. 
6.13 We give you an unfair coin and you do not know P(#) for this coin. Can 


you simulate a fair coin, and how many tosses do you need for each fair coin 
toss? 


4 


Expectation and variance 


Random variables are complicated objects, containing a lot of information 
on the experiments that are modeled by them. If we want to summarize a 
random variable by a single number, then this number should undoubtedly 
be its expected value. The expected value, also called the expectation or mean, 
gives the center—in the sense of average value—of the distribution of the 
random variable. If we allow a second number to describe the random variable, 
then we look at its variance, which is a measure of spread of the distribution 
of the random variable. 


7.1 Expected values 


An oil company needs drill bits in an exploration project. Suppose that it is 
known that (after rounding to the nearest hour) drill bits of the type used 
in this particular project will last 2, 3, or 4 hours with probabilities 0.1, 0.7, 
and 0.2. If a drill bit is replaced by one of the same type each time it has worn 
out, how long could exploration be continued if in total the company would 
reserve 10 drill bits for the exploration job? What most people would do to 
answer this question is to take the weighted average 


0.1-2+40.7-3+0.2-4=3.1, 


and conclude that the exploration could continue for 10 x 3.1, or 31 hours. 
This weighted average is what we call the expected value or expectation of the 
random variable X whose distribution is given by 


P(X =2)=0.1, P(X =3)=0.7, P(X =4)=02. 


It might happen that the company is unlucky and that each of the 10 drill bits 
has worn out after two hours, in which case exploration ends after 20 hours. 
At the other extreme, they may be lucky and drill for 40 hours on these 10 


90 7 Expectation and variance 


bits. However, it is a mathematical fact that the conclusion about a 31-hour 
total drilling time is correct in the following sense: for a large number n of 
drill bits the total running time will be around n times 3.1 hours with high 
probability. In the example, where n = 10, we have, for instance, that drilling 
will continue for 29, 30, 31, 32, or 33 hours with probability more than 0.86, 
while the probability that it will last only for 20, 21, 22, 23, or 24 hours is less 
than 0.00006. We will come back to this in Chapters 13 and 14. This example 
illustrates the following definition. 


DEFINITION. The expectation of a discrete random variable X taking 
the values a),@2,... and with probability mass function p is the 


number 
E[X] => » aj, P(X => ai) => os a;p(a;). 


We also call E[X] the expected value or mean of X. Since the expectation is 
determined by the probability distribution of X only, we also speak of the 
expectation or mean of the distribution. 


QUICK EXERCISE 7.1 Let X be the discrete random variable that takes the 
values 1, 2, 4, 8, and 16, each with probability 1/5. Compute the expectation 
of X. 


Looking at an expectation as a weighted average gives a more physical in- 
terpretation of this notion, namely as the center of gravity of weights p(a;) 
placed at the points a;. For the random variable associated with the drill bit, 
this is illustrated in Figure 7.1. 


2 3, 4 


Fig. 7.1. Expected value as center of gravity. 


7.1 Expected values 91 


This point of view also leads the way to how one should define the expected 
value of a continuous random variable. Let, for example, X be a continuous 
random variable whose probability density function f is zero outside the in- 
terval [0,1]. It seems reasonable to approximate X by the discrete random 
variable Y, taking the values 

n—-1 


sey ; 
n 


1 


1 2 
nn 


with as probabilities the masses that X assigns to the intervals [+, 4}: 


= k/n 
p(y=£)=p(Stexc*)-/ f(x) da. 
n ue n (k-1)/n 


We have a good idea of the size of this probability. For large n, it can be 
approximated well in terms of f: 


ihe 
e(y = =) = tap f(x) da x *;(£). 


The “center-of-gravity” interpretation suggests that the expectation E[Y] of 
Y should approximate the expectation E[X] of X. We have 


“ik k ak 7 kVA 
E = _ — i ~ _ =) 
m= driver aoc s(5) 
k=1 k=1 
By the definition of a definite integral, for large n the right-hand side is close 


to : 
i uf (a) da. 
0 


This motivates the following definition. 


DEFINITION. The expectation of a continuous random variable X 
with probability density function f is the number 


We also call E[X] the expected value or mean of X. Note that E[X] is indeed 
the center of gravity of the mass distribution described by the function /f: 


oo _ ( uf (x) da 
[ tide ea 


This is illustrated in Figure 7.2. 


92 7 Expectation and variance 


Fig. 7.2. Expected value as center of gravity, continuous case. 


QUICK EXERCISE 7.2 Compute the expectation of a random variable U that 
is uniformly distributed over [2, 5]. 


Remark 7.1 (The expected value may not exist!). In the definitions 
in this section we have been rather careless about the convergence of sums 
and integrals. Let us take a closer look at the integral J = J uf (x) da. 
Since a probability density function cannot take negative values, we have 
I=I-4/* with I- = oe af(x) dx a negative and It = f° f(x) da a 
positive number. However, it may happen that J~ equals —oo or It equals 
+oo. If both I~ = —oo and I+ = +00, then we say that the expected 
value does not exist. An example of a continuous random variable for which 
the expected value does not exist is the random variable with the Cauchy 
distribution (see also page 161), having probability density function 


1 
For this random variable 
ra['e : dx = Late) | es 
do m(1+a2) "| Qn a : 
ty) 0 
= 1 a: 2 _ 
I =f eaayee ein +e | = oo. 


If I~ is finite but I* = +00, then we say that the expected value is infinite. 
A distribution that has an infinite expectation is the Pareto distribution 
with parameter a = 1 (see Exercise 7.11). The remarks we made on the 
integral in the definition of E[X] for continuous X apply similarly to the 
sum in the definition of E[X] for discrete random variables X. 


7.2 Three examples 93 
7.2 Three examples 


The geometric distribution 


If you buy a lottery ticket every week and you have a chance of 1 in 10000 
of winning the jackpot, what is the expected number of weeks you have to 
buy tickets before you get the jackpot? The answer is: 10000 weeks (almost 
two centuries!). The number of weeks is modeled by a random variable with 
a geometric distribution with parameter p = 107+. 


THE EXPECTATION OF A GEOMETRIC DISTRIBUTION. Let X have 
a geometric distribution with parameter p; then 


B[X] =) kp—p)* 1 = = 
k=) WV 


Here )7P°, kp(1 — p)*-! = 1/p follows from the formula )>72., kv*-! = 
1/(1 — x)? that has been derived in your calculus course. We will see a simple 
(probabilistic) way to obtain the value of this sum in Chapter 11. 


The exponential distribution 


In Section 5.6 we considered the chemical reactor example, where the residence 
time 7’, measured in minutes, has an Exp(0.5) distribution. We claimed that 
this implies that the mean time a particle stays in the vessel is 2 minutes. 
More generally, we have the following. 


THE EXPECTATION OF AN EXPONENTIAL DISTRIBUTION. Let X 
have an exponential distribution with parameter A; then 


as 1 
E[X] =| pen dy = oC 


The integral has been determined in your calculus course (with the technique 
of integration by parts). 
The normal distribution 


Here, using that the normal density integrates to 1 and applying the substi- 
tution z = (a — p)/o, 


BX|= foo ee) aay [* e- tne A) 7 


- a ee 
= or Zz c ai 
prof ta lu 


94 7 Expectation and variance 


where the integral is 0, because the integrand is an odd function. We obtained 
the following rule. 


THE EXPECTATION OF A NORMAL DISTRIBUTION. Let X be an 
N(, 07) distributed random variable. Then 


ee : FD) nee 


=e GW ir 


7.3 The change-of-variable formula 


Often one does not want to compute the expected value of a random variable 
X but rather of a function of X, as, for example, X?. We then need to deter- 
mine the distribution of Y = X?, for example by computing the distribution 
function Fy of Y (this is an example of the general problem of how distribu- 
tions change under transformations—this topic is the subject of Chapter 8). 
For a concrete example, suppose an architect wants maximal variety in the 
sizes of buildings: these should be of the same width and depth X, but X is 
uniformly distributed between 0 and 10 meters. What is the distribution of 
the area X? of a building; in particular, will this distribution be (anything 
near to) uniform? Let us compute Fy; for 0 < a < 100: 


Fy (a) = P(X* <a) =P(X < Va) = a 


Hence the probability density function fy of the area is, for 0 < y < 100 
meters squared, given by 


d d Wy 1 

= —F = —+— = —_. 7.1 
fr) = PW = an = ae (7.1) 
This means that the buildings with small areas are heavily overrepresented, 
because fy explodes near 0—see also Figure 7.3, in which we plotted fy. 


Surprisingly, this is not very visible in Figure 7.4, an example where we should 
believe our calculations more than our eyes. In the figure the locations of 
the buildings are generated by a Poisson process, the subject of Chapter 12. 
Suppose that a contractor has to make an offer on the price of the foundations 
of the buildings. The amount of concrete he needs will be proportional to the 
area X? of a building. So his problem is: what is the expected area of a 
building? With fy from (7.1) he finds 


100 1 100 V 12 3 100 
B(x] =Bp]= [ vague fp Bay = | ru = 331 m?. 


7.3 The change-of-variable formula 95 


0.0 4 —— 
a ss rs 
0.0 0.2 0.4 0.6 0.8 


Fig. 7.3. The probability density of the square of a U(0,10) random variable. 


It is interesting to note that we really need to do this calculation, because 
the expected area is not simply the product of the expected width and the 
expected depth, which is 25 m?. However, there is a much easier way in which 
the contractor could have obtained this result. He could have argued that 
the value of the area is x? when x is the width, and that he should take the 
weighted average of those values, where the weight at width x is given by the 
value fx (x) of the probability density of X. Then he would have computed 


peed 10 1 l 10 
E[X’] = | x” fx(x) dx = a x. —dgr= | = 334 in”, 
= 0 


It is indeed a mathematical theorem that this is always a correct way to 
compute expected values of functions of random variables. 


Fig. 7.4. Top: widths of the buildings between 0 and 10 meters. Bottom: corre- 
sponding buildings in a 100x300 m area. 


96 7 Expectation and variance 


‘THE CHANGE-OF-VARIABLE FORMULA. Let X be a random variable, 
and let g: R—R be a function. 
If X is discrete, taking the values aj, a2,..., then 


E[g(X)] = de g(a) P(X = aj). 


If X is continuous, with probability density function f, then 


QUICK EXERCISE 7.3 Let X have a Ber(p) distribution. Compute E [2*]. 


An operation that occurs very often in practice is a change of units, e.g., from 
Fahrenheit to Celsius. What happens then to the expectation? Here we have 
to apply the formula with the function g(a) = ra + s, where r and s are 
real numbers. When X has a continuous distribution, the change-of-variable 
formula yields: 


lee} 


E[rX +s] = / (rx + 5) f(a) de 


—co 


arf ap(eyacts f fa)az 


= rELX] +s. 


A similar computation with integrals replaced by sums gives the same result 
for discrete random variables. 


7.4 Variance 


Suppose you are offered an opportunity for an investment whose expected 
return is €500. If you are given the extra information that this expected 
value is the result of a 50% chance of a €450 return and a 50% chance of a 
€ 550 return, then you would not hesitate to spend € 450 on this investment. 
However, if the expected return were the result of a 50% chance of a € 0 return 
and a 50% chance of a € 1000 return, then most people would be reluctant to 
spend such an amount. This demonstrates that the spread (around the mean) 
of a random variable is of great importance. Usually this is measured by the 
expected squared deviation from the mean. 


DEFINITION. The variance Var(X) of a random variable X is the 


number 
Var(X) = E[(X — E[X])?]. 


7.4 Variance 97 


Note that the variance of a random variable is always positive (or 0). Fur- 
thermore, there is the question of existence and finiteness (cf. Remark 7.1). 
In practical situations one often considers the standard deviation defined by 
\/Var(X), because it has the same dimension as E[X]. 


As an example, let us compute the variance of a normal distribution. If X has 
an N(,07) distribution, then: 


Var(X) = E[(X — E[X])?] 


co 5 1 -4(54)" 
= r- e \ ?% dz 
a 1 2 
=o? | 2 e 2* dz 
-~co W270 


Here we substituted z = (a — y)/o. Using integration by parts one finds that 


aa 1 1,2 
2 —sZz 
z e 2% dz=1. 
I. V2 


We have found the following property. 


VARIANCE OF A NORMAL DISTRIBUTION. Let X be an N(,07) 
distributed random variable. Then 


co 1 al al 
Warn(20) = — p)?— s( OO beer 
ar(X) Le i) an. =o 


QUICK EXERCISE 7.4 Let us call the two returns discussed above Y; and Y2, 
respectively. Compute the variance and standard deviation of Y; and Y. 


It is often not practical to compute Var(X) directly from the definition, but 
one uses the following rule. 


AN ALTERNATIVE EXPRESSION FOR THE VARIANCE. For any ran- 
dom variable X, 


Var(X) = EX] — (B[X))°. 


To see that this rule holds, we apply the change-of-variable formula. Sup- 
pose X is a continuous random variable with probability density function f 
(the discrete case runs completely analogously). Using the change-of-variable 
formula, well-known properties of the integral, and ic f(x) da = 1, we find 


98 7 Expectation and variance 


Var(X) = E[(X — E[X])?] 


= f= E(x se) ax 


= - (2? — 208 [X] + (B[X])*) f(x) de 


2 i 2? f(a) de — 2E[X] i f(a) de + (B(x)? f FOG 
= E[X?] — 2(E[X])? + (B[X])? 
= bx") — Exp’, 


With this rule we make two steps: first we compute E[X], then we compute 
E [X?]. The latter is called the second moment of X. Let us compare the 
computations, using the definition and this rule for the drill bit example. 
Recall that for this example X takes the values 2,3, and 4 with probabilities 
0.1,0.7, and 0.2. We found that E[X]= 3.1. According to the definition 


Var(X) = E[(X —3.1)?] = 0.1-(2— 3.1)? +.0.7- (3 — 3.1)? +. 0.2- (4— 3.1)? 
= 0.1-(—1.1)? + 0.7 - (—0.1)? + 0.2 - (0.9)? 
= 0.1-1.21+0.7- 0.01 + 0.2-0.81 
= 0.121 + 0.007 + 0.162 
= 0.29. 


Using the rule is neater and somewhat faster: 


Var(X) = E[X?] — (3.1)? =0.1-27+0.7-37+0.2- 47-961 
=0.1-4+0.7-9+0.2-16 — 9.61 
=0.4+6.343.2—-9.61 
= 0.29. 


What happens to the variance if we change units? At the end of the pre- 
vious section we showed that E[rX + s] = rE[X]+ s. This can be used to 
obtain the corresponding rule for the variance under change of units (see also 
Exercise 7.15). 


EXPECTATION AND VARIANCE UNDER CHANGE OF UNITS. For any 
random variable X and any real numbers r and s, 


E[rX +s] =rE[X]+s, and Var(rX +s) =r?Var(X). 


Note that the variance is insensitive to the shift over s. Can you understand 
why this must be true without doing any computations? 


7.6 Exercises 99 


7.5 Solutions to the quick exercises 


7.1 We have 


E[X] = oaP(X =a) =1- 5 12. 


7.2 The probability density function f of U is given by f(a) = 0 outside [2, 5] 
and f(x) = 1/3 for 2 < x < 5; hence 


ee ae tol 
E[U] =| xf(a)de = [ ~rdr = Fa = 34. 
—oo 2 3 6 |, 
7.3 Using the change-of-variable formula we obtain 
Bex] = 2"P(x =a) 
= 29. P(X =0)+2'- P(X =1) 
=1-(1—p)+2-p=1-—p+2p=1+p. 


You could also have noted that Y = 2* has a distribution given by P(Y = 1) = 
1—p,P(Y = 2) = p; hence 


E[2*] =B[Y] =1-P(Y =1)+2-P(Y =2)=1-(1-p)+2-p=1+p. 


7.4 We have 


Var(Yi) = $(450 — 500)? + $(550 — 500)” = 50° = 2500, 
so Y; has standard deviation € 50 and 
Var(¥2) = 4(0 — 500)” + $(1000 — 500)” = 5007 = 250 000, 


so Y> has standard deviation € 500. 


7.6 Exercises 


7.1 E] Let T be the outcome of a roll with a fair die. 


a. Describe the probability distribution of T,, that is, list the outcomes and 
the corresponding probabilities. 


b. Determine E[T] and Var(T). 


7.2 4] The probability distribution of a discrete random variable X is given 
by 


PXke=-l)=2, PXx=0)=2, Pxk=l=2. 


100 7 Expectation and variance 


a. Compute E[X]. 


b. Give the probability distribution of Y = X? and compute E[Y] using the 
distribution of Y. 


c. Determine E [x al using the change-of-variable formula. Check your an- 
swer against the answer in b. 


d. Determine Var(X). 


7.3 For a certain random variable X it is known that E[X] = 2, Var(X) = 3. 
What is E[X?]? 


7.4 Let X be a random variable with E[X] = 2, Var(X) = 4. Compute the 
expectation and variance of 3 — 2X. 


7.5 ©] Determine the expectation and variance of the Ber(p) distribution. 


7.6 & The random variable Z has probability density function f(z) = 327/19 
for 2 < z < 3 and f(z) = 0 elsewhere. Determine E[Z]. Before you do the 
calculation: will the answer lie closer to 2 than to 3 or the other way around? 


7.7 Given is a random variable X with probability density function f given 
by f(x) = 0 for x < 0, and for x > 1, and f(x) = 42 — 42° for0< a2 <1. 
Determine the expectation and variance of the random variable 2X + 3. 


7.8 LJ Given is a continuous random variable X whose distribution function 
F satisfies F(x) = 0 for x < 0, F(x) =1 for x > 1, and F(x) = x(2 — 2) for 
0<a <1. Determine E[X]. 


7.9 Let U be a random variable with a U(a, 3) distribution. 


a. Determine the expectation of U. 


b. Determine the variance of U. 


7.10 HF) Let X have an exponential distribution with parameter X. 


a. Determine E[X] and E[X?] using partial integration. 
b. Determine Var(X). 


7.11 © In this exercise we take a look at the mean of a Pareto distribution. 


a. Determine the expectation of a Par(2) distribution. 
b. Determine the expectation of a Par(4) distribution. 
c. Let X have a Par(qa) distribution. Show that E[X] = a/(a—1) ifa>1. 


7.12 For which a is the variance of a Par(a) distribution finite? Compute the 
variance for these a. 


7.6 Exercises 101 


7.13 Remember that we found on page 95 that the expected area of a building 
was 333 m’, whereas the square of the expected width was only 25 m?. This 
phenomenon is more general: show that for any random variable X one has 
E[X?] > (E[X])*. 

Hint: you might use that Var(X) > 0. 


7.14 Suppose we choose arbitrarily a point from the square with corners at 
(2,1), (3,1), (2,2), and (3,2). The random variable A is the area of the triangle 
with its corners at (2,1), (3,1), and the chosen point. (See also Exercise 5.9 
and Figure 7.5.) Compute E[A]. 


(2, 2) (3, 2) 


randomly chosen 
point 


(2,1) (3, 1) 


Fig. 7.5. A triangle in a 1x1 square. 


7.15 & Let X be a random variable and r and s any real numbers. Use the 
change-of-units rule E[rX + s] = rE[|X]+ s for the expectation to obtain a 
and b. 


a. Show that Var(rX) = r?Var(X). 
b. Show that Var(X + s) = Var(X). 
c. Combine parts a and b to show that 


Var(rX + s) = r?Var(X). 


7.16 LE) The probability density function f of the random variable X used 
in Figure 7.2 is given by f(x) = 0 outside (0,1) and f(x) = —4aIn(a) for 
0 <a <1. Compute the position of the balancing point in the figure, that is, 
compute the expectation of X. 


7.17 H Let U be a discrete random variable taking the values a1,..., a, with 
probabilities p,,..., Dr. 


a. Suppose all a; > 0, but that E[U]=0. Show then 


102 7 Expectation and variance 


ay =ag=:::=a,=0. 


In other words; P(U = 0) = 1. 


b. Suppose that V is a random variable taking the values };,...,b, with 
probabilities p1,...,p;. Show that Var(V) = 0 implies 


PIV SEV =. 


Hint: apply a with U = (V — E[V))?. 


8 


Computations with random variables 


There are many ways to make new random variables from old ones. Of course 
this is not a goal in itself; usually new variables are created naturally in 
the process of solving a practical problem. The expectations and variances 
of such new random variables can be calculated with the change-of-variable 
formula. However, often one would like to know the distributions of the new 
random variables. We shall show how to determine these distributions, how 
to compare expectations of random variables and their transformed versions 
(Jensen’s inequality), and how to determine the distributions of maxima and 
minima of several random variables. 


8.1 Transforming discrete random variables 


The problem we consider in this section and the next is how the distribution 
of a random variable X changes if we apply a function g to it, thus obtaining 
a new random variable Y: 

Y = g(X). 


When X is a discrete random variable this is usually not too hard to do: it 
is just a matter of bookkeeping. We illustrate this with an example. Imagine 
an airline company that sells tickets for a flight with 150 available seats. It 
has no idea about how many tickets it will sell. Suppose, to keep the example 
simple, that the number X of tickets that will be sold can be anything from 1 
to 200. Moreover, suppose that each possibility has equal probability to occur, 
ie., P(X = 7) = 1/200 for 7 = 1,2,...,200. The real interest of the airline 
company is in the random variable Y, which is the number of passengers that 
have to be refused. What is the distribution of Y? To answer this, note that 
nobody will be refused when the passengers fit in the plane, hence 


150 
P(Y =0) =P(X < 150) = == = -. 


104 8 Computations with random variables 


For the other values, k = 1,2...,50 


1 


Note that in this example the function g is given by g(a) = max{x — 150, 0}. 


QUICK EXERCISE 8.1 Let Z be the number of passengers that will be in the 
plane. Determine the probability distribution of Z. What is the function g in 
this case? 


8.2 Transforming continuous random variables 


We now turn to continuous random variables. Since single values occur with 
probability zero for a continuous random variable, the approach above does 
not work. The strategy now is to first determine the distribution function of 
the transformed random variable Y = g(X) and then the probability density 
by differentiating. We shall illustrate this with the following example (actually 
we saw an example of such a computation in Section 7.3 with the function 
g(x) = x”). 

We consider two methods that traffic police employ to determine whether 
you deserve a fine for speeding. From experience, the traffic police think that 
vehicles are driving at speeds ranging from 60 to 90 km/hour at a certain 
road section where the speed limit is 80 km/hour. They assume that the 
speed of the cars is uniformly distributed over this interval. The first method 
is measuring the speed at a fixed spot in the road section. With this method 
the police will find that about (90 — 80)/(90 — 60) = 1/3 of the cars will be 
fined. 


For the second method, cameras are put at the beginning and end of a 1-km 
road section, and a driver is fined if he spends less than a certain amount of 
time in the road section. Cars driving at 60 km/hour need one minute, those 
driving at 90 km/hour only 40 seconds. Let us therefore model the time T 
an arbitrary car spends in the section by a uniform distribution over (40,60) 
seconds. What is the speed V we deduce from this travelling time? Note that 
for 40 < t < 60, 


—4 
PIT <#) =>. 


Since there are 3600 seconds in an hour we have that 


3600 


V=9(T) =F 


We therefore find for the distribution function Fy(v) = P(V < v) of the 
speed V that 


8.2 Transforming continuous random variables 105 


3600 3600 3600/v) — 40 180 


for all speeds v between 60 and 90. We can now obtain the probability density 
fv of V by differentiating: 


d d 180 180 
ive) = Fv) = 5,(8- =) =49 


for 60 <v < 90. 


It is amusing to note that with the second model the traffic police write fewer 
speeding tickets because 


P(V > 80)=1-P(V <80)=1-(3-— 


(With the first model we found probability 1/3 that a car drove faster than 
80 km/hour.) This is related to a famous result in road traffic research, which 
is succinctly phrased as: “space mean speed < time mean speed” (see [37]). 
It is also related to Jensen’s inequality, which we introduce in Section 8.3. 


Similar to the way this is done in the traffic example, one can determine 
the distribution of Y = 1/X for any X with a continuous distribution. The 
outcome will be that if X has density fx, then the density fy of Y is given 
by 

d 1 1 

fy(y) = qh YY) = — fx (=) for y <O and y>0. 

y y y 
One can give fy (0) any value; often one puts fy (0) = 0. 
QUICK EXERCISE 8.2 Let X have a continuous distribution with probability 
density fx (x) = 1/[m(1 + 2?)]. What is the distribution of Y = 1/X? 


We turn to a second example. A very common transformation is a change of 
units, for instance, from Celsius to Fahrenheit. If X is temperature expressed 
in degrees Celsius, then Y = 2X +32 is the temperature in degrees Fahrenheit. 
Let Fy and Fy be the distribution functions of X and Y. Then we have for 
any a 


Fy (a) =P(Y <a) = p(Ex +32 < a) 


=P(x < 3 («—32) - Fx (3(a-32)). 


By differentiating Fy (using the chain rule), we obtain the probability density 
fy (y) = 3 fx (3(y — 32)). We can do this for more general changes of units, 
and we obtain the following useful rule. 


106 8 Computations with random variables 


CHANGE-OF-UNITS TRANSFORMATION. Let X be a continuous ran- 
dom variable with distribution function F'x and probability density 
function fx. If we change units to Y = rX +s for real numbers r > 0 
and s, then 


Fy (y) = Fx (H*) and fry) = =fx (4H) . 


As an example, let X be a random variable with an N(,07) distribution, 
and let Y = rX +s. Then this rule gives us 


1 —s iL —4((y—rp—s)/ra)? 


r rov 20 


for —oo < y < oo. On the right-hand side we recognize the probability density 
of a normal distribution with parameters rj+s and r?o7. This illustrates the 
following rule. 


NORMAL RANDOM VARIABLES UNDER CHANGE OF UNITS. Let X 
be a random variable with an N(,07) distribution. For any r # 
0 and any s, the random variable rX + s has an N(ru+s,7r?07) 
distribution. 


Note that if X has an N(y, 07) distribution, then with r = 1/0 and s = —p/o 


we conclude that ; x 
Z=—-X+ (-£) a 
o o o 


has an N(0,1) distribution. As a consequence 


oO 


Fx (a) =P(X <a) =P(oZ+u<a)=P(Z< at) =(4-4), 
o 
So any probability for an N(,07) distributed random variable X can be 
expressed in terms of an N(0,1) distributed random variable Z. 


QUICK EXERCISE 8.3 Compute the probabilities P(X <5) and P(X > 2) for 
X with an N(4, 25) distribution. 


8.3 Jensen’s inequality 


Without actually computing the distribution of g(X) we can often tell how 
E[g(X)] relates to g(E[X]). For the change-of-units transformation g(x) = 
rz + s we know that E[g(X)] = g(E[X]) (see Section 7.3). It is a common 


8.3 Jensen’s inequality 107 


error to equate these two sides for other functions g. In fact, equality will very 
rarely occur for nonlinear g. 


For example, suppose that a company that produces microelectronic parts 
has a target production of 240 chips per day, but the yield has only been 40, 
60, and 80 chips on three consecutive days. The average production over the 
three days then is 60 chips, so on average the production should have been 
4 times higher to reach the target. However, one can also look at this in the 
following way: on the three days the production should have been 240/40 = 6, 
240/60 = 4, and 240/80 = 3 times higher. On average that is 


1464443) = 8 = 4.3333 


times higher! What happens here can be explained (take for X the part of the 
target production that is realized, where you give equal probabilities to the 
three outcomes 1/6, 1/4, and 1/3) by the fact that if X is a random variable 
taking positive values, then always 


1 1 
—_<Bl— 
E[x] ~ Ed 
unless Var(X) = 0, which only happens if X is not random at all (cf. Exer- 


cise 7.17). This inequality is the case g(x) = 1/x on (0,00) of the following 
result that holds for general convex functions g. 


JENSEN’S INEQUALITY. Let g be a convex function, and let X be 
a random variable. Then 


g(E[X]) < Elg(X)]. 


Recall from calculus that a twice differentiable function g is conver on an 
interval I if g(x) > 0 for all x in TI, and strictly convex if g(x) > 0 for 
all x in J. When X takes its values in an interval I (this can, for instance, 
be I = (—oo,00)), and g is strictly convex on I, then strict inequality holds: 
g(E[X]) < E[g(X)], unless X is not random. 


In Figure 8.1 we illustrate the way in which this result can be obtained for 
the special case of a random variable X that takes two values, a and b. In the 
figure, X takes these two values with probability 3/4 and 1/4 respectively. 
Convexity of g forces any line segment connecting two points on the graph of 
g to lie above the part of the graph between these two points. So if we choose 
the line segment from (a, g(a)) to (b, g(b)), then it follows that the point 


(E[X] ,E[g(X)]) = (Ga + 75, G9(@) + 79(b)) = 7(@, 9(a)) + 7(6, 9(8)) 


on this line lies “above” the point (E[X],g(E[X]) on the graph of g. Hence 
E[9(X)] = g(E[X]). 


108 8 Computations with random variables 


a EX] b 


Fig. 8.1. Jensen’s inequality. 


A simple example is given by g(x) = x”. This function is convex (g’(x) = 2 
for all x), and hence 
(E[X])? < E[X”]. 


Note that this is exactly the same as saying that Var(X) > 0, which we have 
already seen in Section 7.4. 


QUICK EXERCISE 8.4 Let X be a random variable with Var(X) > 0. Which 
is true: E[e~*] < e~=l*] or E[e~*] > e~#I*1? 


8.4 Extremes 


In many situations the maximum (or minimum) of a sequence X 1, X2,...,Xn 
of random variables is the variable of interest. For instance, let X,, Xo, 
...,%365 be the water level of a river during the days of a particular year 
for a particular location. Suppose there will be flooding if the level exceeds a 
certain height—usually the height of the dykes. The question whether flood- 
ing occurs during a year is completely answered by looking at the maximum 
of X,, Xo, ...,X365. If one wants to predict occurrence of flooding in the fu- 
ture, the probability distribution of this maximum is of great interest. Similar 
models arise, for instance, when one is interested in possible damage from a 
series of shocks or in the extent of a contamination plume in the subsurface. 


We want to find the distribution of the random variable 
Z2= max{X), Xo, ear ene 


We can determine the distribution function of Z by realizing that the maxi- 
mum of the X; is smaller than a number a if and only if all X; are smaller 
than a: 


8.4 Extremes 109 


Fz(a) = P(Z < a) = P(max{X,..., Xn} <a) = P(X <a,...,Xn <a). 


Now suppose that the events {X; < a;} are independent for every choice 
of the a;. In this case we call the random variables independent (see also 
Chapter 9, where we study independence of random variables). In particular, 
the events {X; < a} are independent for all a. It then follows that 

Fz(a) = P(X, <a,...,Xn <a) = P(X <a)---P(X, <a). 


Hence, if all random variables have the same distribution function F’, then 
the following result holds. 


THE DISTRIBUTION OF THE MAXIMUM. Let X11, Xo,...,Xn be n 
independent random variables with the same distribution function 
F, and let Z = max{ Xj, Xo,..., Xn}. Then 


QUICK EXERCISE 8.5 Let X1, X2,...,Xn be independent random variables, 
all with a U(0,1) distribution. Let Z = max{Xj,..., Xn}. Compute the dis- 
tribution function and the probability density function of Z. 


What can we say about the distribution of the minimum? Let 
V= min{X1, Xo, ee Xn}. 


We can now find the distribution function Fy of V by observing that the 
minimum of the X; is larger than a number a if and only if all X; are larger 
than a. The trick is to switch to the complement of the event {V < a}: 


Fy(a)=P(V <a) =1—P(V >a) =1—P(minfX),..., Xn} >a) 
=1—-—P(X, >a,...,Xn >a). 
So using independence and switching back again, we obtain 
Fy(a) =1—P(X, >a,...,Xn > a) =1—P(X, > a)--- P(X, > a) 
=1—-(1—P(X <a))---(1—P(X, <a)). 
We have found the following result for the minimum. 
THE DISTRIBUTION OF THE MINIMUM. Let X 1, Xo,...,Xn be n 


independent random variables with the same distribution function 
F, and let V = min{ X,, X2,..., Xn}. Then 


Fy(a) =1-—(1- F(a))”. 
QUICK EXERCISE 8.6 Let X1, X2,..., Xn be independent random variables, 


all with a U(0,1) distribution. Let V = min{X,,...,X,}. Compute the dis- 
tribution function and the probability density function of V. 


110 8 Computations with random variables 


8.5 Solutions to the quick exercises 


8.1 Clearly Z can take the values 1,...,150. The value 150 is special: 
the plane is full if 150 or more people buy a ticket. Hence P(Z = 150) = 
P(X > 150) = 51/200. For the other values we have P(Z = 7) = P(X =i) 
1/200, for 7 = 1,...,149. Clearly, here g(x) = min{150, x}. 


8.2 The probability density of Y = 1/X is 


We see that 1/X has the same distribution as X! (This distribution is called 
the standard Cauchy distribution, it will be introduced in Chapter 11.) 


8.3 First define Z = (X —4)/5, which has an N(0, 1) distribution. Then from 
Table B.1 


5-4 
P(X <5)= P(z < —) = P(Z < 0.20) = 1 — 0.4207 = 0.5793. 


Similarly, using the symmetry of the normal distribution, 


P(X >2)=P (z > a) = P(Z > —0.40) = P(Z < 0.40) = 0.6554. 


8.4 If g(x) =e”, then g’ (x4) =e~” > 0; hence g is strictly convex. It follows 
from Jensen’s inequality that 


eo HI*] <E [e-*] . 


Moreover, if Var(X) > 0, then the inequality is strict. 


8.5 The distribution function of the X; is given by F(x) = x on (0, 1]. There- 
fore the distribution function Fz of the maximum Z is equal to Fz(a) = 
(F(a))” =a”. Its probability density function is 

d 1 


fa(z) = —F2(z) =n2"" 


e forO<2z<1. 


8.6 The distribution function of the X; is given by F(x) = x on [0,1]. There- 
fore the distribution function Fy of the minimum V is equal to Fy(a) = 
1 —(1- a)". Its probability density function is 


fv(ve) = —Fy(v) =n(1—-v)"" forO<v <1. 


8.6 Exercises 111 


8.6 Exercises 


8.1L 


Often one is interested in the distribution of the deviation of a random 


variable X from its mean up = E[X]. Let X take the values 80,90, 100,110, 
and 120, all with probability 0.2; then E[X] = « = 100. Determine the dis- 
tribution of Y = |X — yp]. That is, specify the values Y can take and give the 
corresponding probabilities. 


8.2 


Suppose X has a uniform distribution over the points {1,2,3,4,5,6} 


and that g(x) = sin($2). 


a. 


b. 


Determine the distribution of Y = g(X) = sin($X), that is, specify the 
values Y can take and give the corresponding probabilities. 


Let Z = cos($X). Determine the distribution of Z. 


c. Determine the distribution of W = Y? + Z?. Warning: in this example 


8.3L 


8.4 


a. 


b. 


8.5L 


there is a very special dependency between Y and Z, and in general it is 
much harder to determine the distribution of a random variable that is a 
function of two other random variables. This is the subject of Chapter 11. 


The continuous random variable U is uniformly distributed over [0, 1]. 


. Determine the distribution function of V = 2U + 7. What kind of distri- 


bution does V have? 


. Determine the distribution function of V = rU + s for all real numbers 


r >0 and s. See Exercise 8.9 for what happens for negative r. 
Transforming exponential distributions. 


Let X have an Exp(4) distribution. Determine the distribution function 
of 4X . What kind of distribution does 4X have? 
Let X have an Exp(A) distribution. Determine the distribution function 
of AX. What kind of distribution does AX have? 


Let X be a continuous random variable with probability density func- 


tion 


a. 
b. 


Cc. 


8.6 


3 
= sa(2—a) for0<a<2 
Ix(@) 0 elsewhere. 


Determine the distribution function F'y. 
Let Y = VX. Determine the distribution function Fy. 
Determine the probability density of Y. 


Let X be a continuous random variable with probability density fx that 


takes only positive values and let Y = 1/X. 


112 8 Computations with random variables 


a. Determine Fy(y) and show that 
1 1 
fr(y) = = Ix(=) for y > 0. 
y y 


b. Let Z = 1/Y. Using a, determine the probability density fz of Z, in terms 
of fx. 


8.7 Let X have a Par(q) distribution. Determine the distribution function of 
In. X. What kind of a distribution does In_X have? 


8.8 FE Let X have an Ezp(1) distribution, and let a and A be positive numbers. 
Determine the distribution function of the random variable 
X 1/a 

| 


The distribution of the random variable W is called the Weibull distribution 
with parameters a and X. 


8.9 Let X be a continuous random variable. Express the distribution function 
and probability density of the random variable Y = —X in terms of those of X. 


8.10 & Let X be an N(3,4) distributed random variable. Use the rule for 
normal random variables under change of units and Table B.1 to determine 
the probabilities P(X > 3) and P(X < 1). 


8.11 H Let X be a random variable, and let g be a twice differentiable function 
with g’(a) < 0 for all x. Such a function is called a concave function. Show 
that for concave functions always 


g(E[X]) > E[g(X)]. 


8.12 H Let X be a random variable with the following probability mass func- 
tion: 


x 0 1 100 10000 
P(X=2) $3 4 3 


a. Determine the distribution of Y = VX. 

b. Which is larger E [vx] or \/E[X]? 
Hint: use Exercise 8.11, or start by showing that the function g(x) = —/z 
is convex. 


c. Compute \/E[X] and E [vx] to check your answer (and to see that it 
makes a big difference!). 


8.13 Let W have a U(z,2z7) distribution. What is larger: E[sin(W)] or 
sin(E [W])? Check your answer by computing these two numbers. 


8.6 Exercises 113 


8.14 In this exercise we take a look at Jensen’s inequality for the function 
g(x) = x° (which is neither convex nor concave on (—oo, 00)). 


a. Can you find a (discrete) random variable X with Var(X) > 0 such that 


E[X*] = (E[X])*? 


b. Under what kind of conditions on a random variable X will the inequality 
E[X°] > (E[X])? certainly hold? 


8.15 Let X1, Xo,...,X» be independent random variables, all with a U(0, 1) 
distribution. Let Z = max{X,,...,X,} and V = min{Xj,..., Xp}. 


a. Compute E[max{X1, X2}] and E[min{ Xj, X2}}]. 
b. Compute E[Z] and E[V] for general n. 


c. Can you argue directly (using the symmetry of the uniform distribu- 
tion (see Exercise 6.3) and not the result of the computation in b) that 
1 — E[max{Xj,...,X,}] = E[min{X1,...,Xn}]? 


8.16 In this exercise we derive a kind of Jensen inequality for the minimum. 


a. Let a and b be real numbers. Show that 
1 
min{a,b} = 5(4 +b-—|a—bd)). 


b. Let X and Y be independent random variables with the same distribution 
and finite expectation. Deduce from a that 


1 
E[min{X, Y}] = E[X]— 5E [|X —Y]]. 
c. Show that 
E[min{X, Y}] < min{E[X], E[Y]}. 
Remark: this is not so interesting, since min{E[X],E[Y]} = E[X] = E[Y], 
but we will see in the exercises of Chapter 11 that this inequality is also true 


for X and Y, which do not have the same distribution. 


8.17 Let X1,...,X, be n independent random variables with the same dis- 
tribution function F’. 


a. Convince yourself that for any numbers 21,..., 2, it is true that 
min{21,...,%,} = —max{—a1,...,—ay}. 


b. Let Z = max{X1, Xo,...,X»} and V = min{ X, Xo,..., X,}. Use Exer- 
cise 8.9 and the observation in a to deduce the formula 


114 8 Computations with random variables 
Fy (a) =1—-(1— F(a))” 


directly from the formula 


8.18 FH Let X 1, X2,...,X, be independent random variables, all with an 
Exp(A) distribution. Let V = min{X1,...,X,}. Determine the distribution 
function of V. What kind of distribution is this? 


8.19 H From the “north pole” N of a circle with diameter 1, a point Q on 
the circle is mapped to a point t on the line by its projection from N, as 
illustrated in Figure 8.2. 


EX 2 


Fig. 8.2. Mapping the circle to the line. 


Suppose that the point @ is uniformly chosen on the circle. This is the same 
as saying that the angle y is uniformly chosen from the interval [-4, 5] (can 
you see this?). Let X be this angle, so that X is uniformly distributed over 
the interval [-$,4]. This means that P(X < y) = 1/2+ o/m (cf. Quick 
exercise 5.3). What will be the distribution of the projection of Q on the line? 
Let us call this random variable Z. Then it is clear that the event {Z < t} is 
equal to the event {X < wy}, where ¢ and y correspond to each other under 
the projection. This means that tan(y) = t, which is the same as saying that 


arctan(t) = y. 


a. What part of the circle is mapped to the interval [1, co)? 


b. Compute the distribution function of Z using the correspondence between 
t and y. 


c. Compute the probability density function of Z. 


The distribution of Z is called the Cauchy distribution (which will be discussed 
in Chapter 11). 


9 


Joint distributions and independence 


Random variables related to the same experiment often influence one another. 
In order to capture this, we introduce the joint distribution of two or more 
random variables. We also discuss the notion of independence for random 
variables, which models the situation where random variables do not influence 
each other. As with single random variables we treat these topics for discrete 
and continuous random variables separately. 


9.1 Joint distributions of discrete random variables 


In a census one is usually interested in several variables, such as income, age, 
and gender. In itself these variables are interesting, but when two (or more) are 
studied simultaneously, detailed information is obtained on the society where 
the census is performed. For instance, studying income, age, and gender jointly 
might give insight to the emancipation of women. 


Without mentioning it explicitly, we already encountered several examples of 
joint distributions of discrete random variables. For example, in Chapter 4 we 
defined two random variables S and M, the sum and the maximum of two 
independent throws of a die. 


QUICK EXERCISE 9.1 List the elements of the event {S = 7,M = 4} and 
compute its probability. 


In general, the joint distribution of two discrete random variables X and Y, 
defined on the same sample space Q, is given by prescribing the probabilities 
of all possible values of the pair (X,Y). 


116 9 Joint distributions and independence 


DEFINITION. The joint probability mass function p of two discrete 
random variables X and Y is the function p: R? — [0,1], defined by 


p(a,b) =P(X =a,Y=b) for —w<a,b< oo. 


To stress the dependence on (X,Y), we sometimes write px.y instead of p. 
If X and Y take on the values aj,q@2,...,a% and 64, bo,...,b¢, respectively, 
the joint distribution of X and Y can simply be described by listing all the 
possible values of p(a;,b;). For example, for the random variables S and M 
from Chapter 4 we obtain Table 9.1. 


Table 9.1. Joint probability mass function p(a,b) = P(S =a, M = b). 


b 
a 1 2 3 4 5 6 
2 1/6 0 0 0 0 0 
3 0 2/6 0 0 oO O 
4 0 1/36 2/3 0 oO O 
5 0 0 2/36 2/3 0 O 
6 0 0 1/36 2/36 2/36 0 
‘4 0 0 0. 2/36 2/36 2/36 
8 0 0 0 1/36 2/36 2/36 
9 0 0 0 0. 2/36 2/36 
10 0 OO O 0 1/36 2/36 
11 0 0 0 0 0. 2/36 
12 0 0 0 0 0° 1/36 


From this table we can retrieve the distribution of S and of MM. For example, 
because 


{S=6}={S=6,M=1}U{S=6,M=2}U---U{S=6,M =6}, 
and because the six events 


{S=6,M =1},{S =6,M =2},...,{S =6,M =6} 


are mutually exclusive, we find that 


ps(6) = P(S =6) =P(S =6,M =1)+---+P(S =6,M =6) 


1 2 2 
=0+0+ +2. +5,70 


36° 


9.1 Joint distributions of discrete random variables 117 


Table 9.2. Joint distribution and marginal distributions of S and M. 


b 
a 1 2 3 4 5 6 ps(a) 
2 1/36 0 0 0 0 0 1/36 
3 0 2/36 0 0 0 0 2/36 
4 0 1/36 2/36 0 0 0 3/36 
5 0 0 2/36 2/36 0 0 4/36 
6 0 0 1/36 2/36 2/36 0 5/36 
fe 0 0 0 2/36 2/36 2/36 6/36 
8 0 0 0 1/36 2/36 2/36 5/36 
9 0 0 0 0 2/36 2/36 4/36 
10 0 0 0 0 1/36 2/36 3/36 
it 0 0 0 0 0 2/36 2/36 
12 0 0 0 0 0 1/36 1/36 
pm(b) 1/36 3/36 5/36 7/36 9/36 11/36 1 


Thus we see that the probabilities of S can be obtained by taking the sum 
of the joint probabilities in the rows of Table 9.1. This yields the probability 
distribution of S, i.e., all values of ps(a) for a = 2,...,12. We speak of the 
marginal distribution of S'. In Table 9.2 we have added this distribution in the 
right “margin” of the table. Similarly, summing over the columns of Table 9.1 
yields the marginal distribution of M, in the bottom margin of Table 9.2. 
The joint distribution of two random variables contains a lot more information 
than the two marginal distributions. This can be illustrated by the fact that in 
many cases the joint probability mass function of X and Y cannot be retrieved 
from the marginal probability mass functions px and py. A simple example 
is given in the following quick exercise. 


QUICK EXERCISE 9.2 Let X and Y be two discrete random variables, with 
joint probability mass function p, given by the following table, where ¢ is an 
arbitrary number between —1/4 and 1/4. 


a 0 1 px (a) 


0 1/4—e 1/4+6¢ 
1 1f4+e 1/4-e 


py (0) 


Complete the table, and conclude that we cannot retrieve p from px and py. 


118 9 Joint distributions and independence 


The joint distribution function 


As in the case of a single random variable, the distribution function enables 
us to treat pairs of discrete and pairs of continuous random variables in the 
same way. 


DEFINITION. The joint distribution function F of two random vari- 
ables X and Y is the function F : R* — [0,1] defined by 


F(a,b) =P(X <a,Y <b) for —cw<a,b<o. 


QUICK EXERCISE 9.3 Compute F'(5,3) for the joint distribution function F 
of the pair (S,M). 


The distribution functions F'x and Fy can be obtained from the joint distri- 
bution function of X and Y. As before, we speak of the marginal distribution 
functions. The following rule holds. 


FROM JOINT TO MARGINAL DISTRIBUTION FUNCTION. Let F' be 
the joint distribution function of random variables X and Y. Then 
the marginal distribution function of X is given for each a by 


Fx(a) = P(X < a) = F(a, +00) = jim F(a,b), (9.1) 
and the marginal distribution function of Y is given for each b by 


Fy (b) = P(Y <b) = F(+00,b) = lim F(a,b). (9.2) 


9.2 Joint distributions of continuous random variables 


We saw in Chapter 5 that the probability that a single continuous random 
variable X lies in an interval [a, b], is equal to the area under the probability 
density function f of X over the interval (see also Figure 5.1). For the joint 
distribution of continuous random variables X and Y the situation is analo- 
gous: the probability that the pair (X, Y) falls in the rectangle [a1, bi] x [a2, be] 
is equal to the volume under the joint probability density function f(a, y) of 
(X,Y) over the rectangle. This is illustrated in Figure 9.1, where a chunk of 
a joint probability density function f(a, y) is displayed for « between —0.5 
and 1 and for y between —1.5 and 1. Its volume represents the probability 
P(—0.5 < X <1,-1.5 < Y <1). As the volume under f on [—0.5, 1] [—1.5, 1] 
is equal to the integral of f over this rectangle, this motivates the following 
definition. 


9.2 Joint distributions of continuous random variables 119 


Fig. 9.1. Volume under a joint probability density function f on the rectangle 
[—0.5, 1] x [-1.5, 1]. 


DEFINITION. Random variables X and Y have a joint continuous 
distribution if for some function f : R? — R and for all numbers 
a1, 42 and by, be with ays by and ag < bo, 


by be 
Pla: $X <bryaz <¥ <b) = [ f(x, y) da dy. 
ay a2 


The function f has to satisfy f(x,y) > 0 for all x and y, and 
Hoe (es f(x,y) dady = 1. We call f the joint probability density 
function of X and Y. 


As in the one-dimensional case there is a simple relation between the joint 
distribution function F' and the joint probability density function /: 


2 


a b 
F(at)= ff fley)dedy and fey) = Fle). 


A joint probability density function of two random variables is also called 
a bivariate probability density. An explicit example of such a density is the 
function 30 
R - —50x?—50y?+80ry 
r,y)=—e 
f(z,y) = — 


for —co < @ < oo and —oo < y < ow; see Figure 9.2. This is an example of 
a bivariate normal density (see Remark 11.2 for a full description of bivariate 
normal distributions). 

We illustrate a number of properties of joint continuous distributions by means 
of the following simple example. Suppose that X and Y have joint probability 


120 9 Joint distributions and independence 


SS> 
os 
Ss 


SSS2>, 
SSS 


Ss 


Fig. 9.2. A bivariate normal probability density function. 


density function 


f(x,y) _ 


ae (20° + xy”) for0O<a<3 and 1l<y<2, 


and f(x,y) =0 otherwise; see Figure 9.3. 


Fig. 9.3. The probability density function f(x,y) = 4(2a7y + zy’) 


9.2 Joint distributions of continuous random variables 121 


As an illustration of how to compute joint probabilities: 


2 73 
[ [p tenavay 
2 py ft 
=o. ( (20%y + ay?) dy) 
1 $ 


2 [ 2,61.) 4 187 
= x a! , = 
75 Sh 81) (2025 
Next, for a between 0 and 3 and b between 1 and 2, we determine the ex- 


pression of the joint distribution function. Since f(x,y) = 0 for « < 0 or 
y <1, 


I 


Plex Sve" 
S 3 


a b 
F(a.) =P(X Sa,¥ <b)= f (/ Hes)ay) 


9 a b 
= | (| (20%y + xy?) dy) de 
0 1 


1 
apg (208 = 2a? + ab? = a’). 


I 


Note that for either a outside [0, 3] or b outside [1, 2], the expression for F'(a, b) 
is different. For example, suppose that a is between 0 and 3 and 0 is larger 
than 2. Since f(a, y) = 0 for y > 2, we find for any b > 2: 


1 


F(a,6) = P(X <4,¥ <b) =P(X <a,¥ <2) = F(a,2) = 5 


(6a° a cae 
Hence, applying (9.1) one finds the marginal distribution function of X: 
: 1 
Fx (a) = jim, F(a,b) = pra al + 7a”) 


for a between 0 and 3. 


QUICK EXERCISE 9.4 Show that Fy(b) = 7 (3b® + 180? — 21) for b between 1 
and 2. 


The probability density of X can be found by differentiating F'x: 


d d/l 3 2 2 5 
for x between 0 and 3. It is also possible to obtain the probability density 
function of X directly from f(x, y). Recall that we determined marginal prob- 
abilities of discrete random variables by summing over the joint probabilities 
(see Table 9.2). In a similar way we can find fx. For x between 0 and 3, 


122 9 Joint distributions and independence 


f(a) =f fevay=% [ (20%y + ay?) dy = (92? + 70). 


This illustrates the following rule. 


FROM JOINT TO MARGINAL PROBABILITY DENSITY FUNCTION. Let 
f be the joint probability density function of random variables X 
and Y. Then the marginal probability densities of X and Y can be 
found as follows: 


fx(x) = / “ A@ayiy eal e)e / ” Heandes 


Hence the probability density function of each of the random variables X and 
Y can easily be obtained by “integrating out” the other variable. 


QUICK EXERCISE 9.5 Determine fy(y). 


9.3 More than two random variables 


To determine the joint distribution of n random variables X,, X2,...,Xn, all 
defined on the same sample space 2, we have to describe how the probability 
mass is distributed over all possible values of (X1, Xo,...,X»). In fact, it 
suffices to specify the joint distribution function F of X,, X2,...,Xn, which 
is defined by 


F(a1,Q2,...,@n) = P(X <a 1, Xe <dg,...,Xn < Gn) 


for —0o < a1, 42,..-,An, < ©. 


In case the random variables X1, X2,..., Xn are discrete, the joint distribution 
can also be characterized by specifying the joint probability mass function p 
of X1, X2,...,Xn, defined by 


p(a1, d2,---,4n) = P(X, = a1, Xo = a9,..., Xn = An) 
for —0o < a1, 42,...,An < OO. 


Drawing without replacement 


Let us illustrate the use of the joint probability mass function with an example. 
In the weekly Dutch National Lottery Show, 6 balls are drawn from a vase 
that contains balls numbered from 1 to 41. Clearly, the first number takes 
values 1,2,...,41 with equal probabilities. Is this also the case for—say—the 
third ball? 


9.3 More than two random variables 123 


Let us consider a more general situation. Suppose a vase contains balls num- 
bered 1,2,...,.N. We draw n balls without replacement from the vase. Note 
that n cannot be larger than N. Each ball is selected with equal probability, 
ie., in the first draw each ball has probability 1/N, in the second draw each of 
the N—1 remaining balls has probability 1/(V —1), and so on. Let X; denote 
the number on the ball in the 7-th draw, for i = 1,2,...,n. In order to obtain 
the marginal probability mass function of X;, we first compute the joint proba- 
bility mass function of X1, X2,..., Xn. Since there are N(N—1)---(N—n+1) 
possible combinations for the values of X1, X2,...,Xn, each having the same 
probability, the joint probability mass function is given by 


p(a1,G2,---,4n) = P(X, = ay, Xo = ag,..., Xn = Gn) 
_ 1 
~ NIN-1)--(N—n+1)’ 


for all distinct values a1, @2,...,@n with 1 <a; < N. Clearly Xj, X2,...,Xn 
influence each other. Nevertheless, the marginal distribution of each X; is 
the same. This can be seen as follows. Similar to obtaining the marginal 
probability mass functions in Table 9.2, we can find the marginal probability 
mass function of X; by summing the joint probability mass function over all 
possible values of Xy,...,Xj-1, Xi41,---,Xn: 


Cee en eee 
= Wo Wo 
~ £0 N(N = 1). (N—n41)’ 


where the sum runs over all distinct values a1,a2,...,@, with 1 <a; < N 
and a; = k. Since there are (N —1)(N —2)---(N —n+1) such combinations, 
we conclude that the marginal probability mass function of X; is given by 


1 1 


pall) = (N- DN 2) (N41): Say ay 


for k = 1,2,...,N. We see that the marginal probability mass function of 
each X; is the same, assigning equal probability 1/N to each possible value. 

In case the random variables X,, X2,...,X» are continuous, the joint dis- 
tribution is defined in a similar way as in the case of two variables. We say 
that the random variables X), X2,...,Xn have a joint continuous distribu- 
tion if for some function f : R” — R and for all numbers aj, d2,...,a, and 
b4, ba, see On with ay < bi, 


P(a, < X1 < by, a2 < X2 < de,. 15 An < Xn < by) 


br phe 
=[ [- fa f(@1,@2,---,@n) dx, dx2-++ day. 


Again f has to satisfy f(x1,22,...,%,) > 0 and f has to integrate to 1. We 
call f the joint probability density of X,, X2,...,Xn. 


124 9 Joint distributions and independence 


9.4 Independent random variables 


In earlier chapters we have spoken of independence of random variables, an- 
ticipating a formal definition. On page 46 we postulated that the events 


{Ri => ay}, {Ro = ag}, saay {Rio = aio} 


related to the Bernoulli random variables R,,..., Rio are independent. How 
should one define independence of random variables? Intuitively, random vari- 
ables X and Y are independent if every event involving only X is indepen- 
dent of every event involving only Y. Since for two discrete random variables 
X and Y, any event involving X and Y is the union of events of the type 
{X =a,Y = b}, an adequate definition for independence would be 


P(X =a,Y =b) =P(X =a)P(Y =b), (9.3) 


for all possible values a and b. However, this definition is useless for continuous 
random variables. Both the discrete and the continuous case are covered by 
the following definition. 


DEFINITION. The random variables X and Y, with joint distribution 
function F’, are independent if 


P(X <a,Y <b) =P(X <a)P(Y <3), 


that is, 
F (a,b) = Fx(a)Fy(b) (9.4) 


for all possible values a and b. Random variables that are not inde- 
pendent are called dependent. 


Note that independence of X and Y guarantees that the joint probability of 
{X <a,Y < b} factorizes. More generally, the following is true: if X and Y 
are independent, then 


P(X © A,Y € B)=P(X € A)P(Y €B), (9.5) 


for all suitable A and B, such as intervals and points. As a special case we 
can take A = {a}, B = {b}, which yields that for independent X and Y the 
probability of {X = a, Y = b} equals the product of the marginal probabilities. 
In fact, for discrete random variables the definition of independence can be 
reduced—after cumbersome computations—to equality (9.3). For continuous 
random variables X and Y we find, differentiating both sides of (9.4) with 
respect to x and y, that 


f(x,y) = fx (a) fy (y)- 


9.5 Propagation of independence 125 


QUICK EXERCISE 9.6 Determine for which value of ¢ the discrete random 
variables X and Y from Quick exercise 9.2 are independent. 


More generally, random variables X1, X2,..., Xn, with joint distribution func- 
tion F’, are independent if for all values aj,...,@n, 


F(a1,02,-..,@n) = Fx, (a1) Fx, (a2): ++ Fx, (an). 


As in the case of two discrete random variables, the discrete random variables 
X1,X2,...,Xn are independent if 


P(X = @1,...,Xn = Gn) =P(X = Q1)-++P(Xn = an), 


for all possible values aj,...,@,. Thus we see that the definition of inde- 
pendence for discrete random variables is in agreement with our intuitive 
interpretation given earlier in (9.3). 

In case of independent continuous random variables X,, X2,..., Xn with joint 
probability density function f, differentiating the joint distribution function 
with respect to all the variables gives that 


f (1, 22,---,%n) = fx, (21) fx, (@2) +++ fx, (€n) (9.6) 


for all values 71,..., 2%. By integrating both sides over (—oo, a1] x (—0o, ag] x 
-++X (—00, Gy], we find the definition of independence. Hence in the continuous 
case, (9.6) is equivalent to the definition of independence. 


9.5 Propagation of independence 


A natural question is whether transformed independent random variables are 
again independent. We start with a simple example. Let X and Y be two 
independent random variables with joint distribution function F’. Take an 
interval I = (a,b] and define random variables U and V as follows: 


Are U and V independent? Yes, they are! By using (9.5) and the independence 
of X and Y, we can write 


PU =0,V=1)=P(X el* Yer) 
= P(X €I°)P(Y €T) 
PU =O P(V =i). 


By a similar reasoning one finds that for all values a and b, 


126 9 Joint distributions and independence 


P(U =a,V =b) = P(U =a)P(V =D). 


This illustrates the fact that for independent random variables X1, X2,..., Xn, 
the random variables Y), Y2,..., Yn, where each Y; is determined by X; only, 
inherit the independence from the X;. The general rule is given here. 


PROPAGATION OF INDEPENDENCE. Let Xj, X2,...,X» be indepen- 
dent random variables. For each 7, let h; : R — R be a function and 
define the random variable 


Y; = hi(Xi). 
Then Yj, Y2,..-, Yn are also independent. 


Often one uses this rule with all functions the same: h; = h. For instance, in 
the preceding example, 
1 if I 
one wee 


0 ifa¢€ Tl. 


The rule is also useful when we need different transformations for different 
X;. We already saw an example of this in Chapter 6. In the single-server 
queue example in Section 6.4, the Exp(0.5) random variables T;,T>,... and 
U(2,5) random variables $1, S2,... are required to be independent. They are 
generated according to the technique described in Section 6.2. With a se- 
quence U;, U2,... of independent U(0,1) random variables we can accomplish 
independence of the T; and S; as follows: 


T; = FY (Uo;_1) and Si = GCG" (Us,), 


where F and G are the distribution functions of the Exp (0.5) distribution and 
the U(2,5) distribution. The propagation-of-independence rule now guaran- 
tees that all random variables 7), 51, 7>,.S2,... are independent. 


9.6 Solutions to the quick exercises 


9.1 The only possibilities with the sum equal to 7 and the maximum equal 
to 4 are the combinations (3,4) and (4,3). They both have probability 1/36, 
so that P(S = 7, M = 4) = 2/36. 


9.2 Since px (0), px (1), py (0), and py(1) are all equal to 1/2, knowing only 
px and py yields no information on ¢ whatsoever. You have to be a student 
at Hogwarts to be able to get the values of p right! 


9.3 Since S and M are discrete random variables, F'(5,3) is the sum of the 
probabilities P(S = a, M = b) of all combinations (a, b) with a < 5 and b < 3. 
From Table 9.2 we see that this sum is 8/36. 


9.7 Exercises 127 


9.4 For a between 0 and 3 and for b between 1 and 2, we have seen that 


1 
35 = (2a°b? — 207 + a*b* — a’). 
Since f(x,y) = 0 for x > 3, we find for any a > 3 and b between 1 and 2: 


F(a,b) = P(X <a,Y <b) =P(X <3,Y <b) 


F(a,b) = 


t 
= F(3,b) = ae (38° + 18)? — 21). 


As a result, applying (9.2) yields that Fy(b) = limgso F(a,b) = F(3,b) = 
az (303 + 18b? — 21), for b between 1 and 2. 


9.5 For y between 1 and 2, we have seen that Fy(y) = = (3y? + 18y? — 21). 
Differentiating with respect to y yields that 


fy(y) = : 25 


—Fyy) = 
for y between 1 and 2 (and fy(y) = 0 otherwise). The probability density 
function of Y can also be obtained directly from f(x,y). For y between 1 
and 2: 


+ By? + 12y), 


ee fla,y)d (20? y+ ay’) dx 


2 a2 2= 3 
= sls —(3y? +12 


Since f(x,y) = 0 for values of y not between 1 and 2, we have that fy(y) = 
C. f(x,y) dx = 0 for these y’s. 


9.6 The number ¢ is between —1/4 and 1/4. Now X and Y are independent 
in case p(i,j) = P(X =i, Y = 7) = P(X =1)P(Y = 9) = px(i)py(y), for all 
i,j =0,1. Ift = 7 =0, we should have 


7 —€ = (0,0) = px (0) pr(0) = 5. 

This implies that ¢ = 0. Furthermore, for all other combinations (i, 7) one 
can check that for ¢ = 0 also p(i,j) = px(i)py(j), so that X and Y are 
independent. If ¢ 0, we have p(0,0) 4 px(0) py(0), so that X and Y are 


dependent. 


9.7 Exercises 


9.1 The joint probabilities P(X = a,Y =) of discrete random variables X 
and Y are given in the following table (which is based on the magical square 
in Albrecht Diirer’s engraving Melencolia I in Figure 9.4). Determine the 
marginal probability distributions of X and Y, i.e., determine the probabilities 
P(X =a) and P(Y = Bb) for a,b = 1,2,3,4. 


128 9 Joint distributions and independence 


Fig. 9.4. Albrecht Diirer’s Melencolia I. 


Albrecht Diirer (German, 1471-1528) Melencolia I, 1514. Engraving. Bequest 
of William P. Chapman, Jr., Class of 1895. Courtesy of the Herbert F. Johnson 
Museum of Art, Cornell University. 


1 2 3 4 


16/136 3/136 2/136 13/136 
5/136 10/136 11/136 8/136 
9/136 6/136 7/136 12/136 
4/136 15/136 14/136 1/136 


Ronme|lo 


9.7 Exercises 129 


9.2 H The joint probability distribution of two discrete random variables X 
and Y is partly given in the following table. 


b 0 1 2 P(Y=bd) 
= wea. sks 1/2 
sae PD aes 1/2 
P(X=a) 1/6 2/3 1/6 1 


a. Complete the table. 
b. Are X and Y dependent or independent? 


9.3 Let X and Y be two random variables, with joint distribution the Melen- 
colia distribution, given by the table in Exercise 9.1. What is 


a. P(X =Y)? 
b. P(X +Y =5)? 
c. P(l< X <3,1<Y <3)? 


d. P((X,Y) € {1,4} x {1,4})? 


9.4 This exercise will be easy for those familiar with Japanese puzzles called 
nonograms. The marginal probability distributions of the discrete random 
variables X and Y are given in the following table: 


a 
b 1 2 3 4 = 5 P(Y=b) 
1 5/14 
2 4/14 
3 2/14 
4 2/14 
5 1/14 
P(X=a) 1/14 5/14 4/14 2/14 2/14 1 


Moreover, for a and 6 from 1 to 5 the joint probability P(X = a,Y = b) is 
either 0 or 1/14. Determine the joint probability distribution of X and Y. 


9.5 CE] Let 7 be an unknown real number, and let the joint probabilities 
P(X =a,Y =b) of the discrete random variables X and Y be given by the 
following table: 


130 9 Joint distributions and independence 


=) 


ao RLS 
ar © 


i 
8 
1 1 
a3 ie a7" 


a. Which are the values 7 can attain? 
b. Is there a value of 7 for which X and Y are independent? 


9.6 FL) Let X and Y be two independent Ber( 
random variables U and V by: 


U=X+Y and V=|X-Y|. 


+) random variables. Define 


a. Determine the joint and marginal probability distributions of U and V. 
b. Find out whether U and V are dependent or independent. 


9.7 To investigate the relation between hair color and eye color, the hair color 
and eye color of 5383 persons was recorded. The data are given in the following 
table: 


Hair color 
Eye color Fair/red Medium Dark/black 
Light 1168 825 305 
Dark 573 1312 1200 


Source: B. Everitt and G. Dunn. Applied multivariate data analysis. Second 
edition Hodder Arnold, 2001; Table 4.12. Reproduced by permission of Hodder 
& Stoughton. 


Eye color is encoded by the values 1 (Light) and 2 (Dark), and hair color by 
1 (Fair/red), 2 (Medium), and 3 (Dark/black). By dividing the numbers in 
the table by 5383, the table is turned into a joint probability distribution for 
random variables X (hair color) taking values 1 to 3 and Y (eye color) taking 
values 1 and 2. 


a. Determine the joint and marginal probability distributions of X and Y. 
b. Find out whether X and Y are dependent or independent. 


9.8 H Let X and Y be independent random variables with probability distri- 
butions given by 


P(X =0)=PX=1l)=]=s and P¥Y=0)=PY =2) 


L 
ok 


9.7 Exercises 131 


a. Compute the distribution of 7 = X + Y. 

b. Let Y and Z be independent random variables, where Y has the same 
distribution as Y, and Z the same distribution as Z. Compute the distri- 
bution of X = Z-Y. 


9.9 H Suppose that the joint distribution function of X and Y is given by 


F(x,y) =1—e7® —e%¥+e-P"*) fa >0, y>0, 
and F(x, y) = 0 otherwise. 


a. Determine the marginal distribution functions of X and Y. 

b. Determine the joint probability density function of X and Y. 

c. Determine the marginal probability density functions of X and Y. 
d. Find out whether X and Y are independent. 


9.10 LF) Let X and Y be two continuous random variables with joint proba- 
bility density function 


12 
F(z,y) = —ay(1 + y) forO<a<landO<y<l, 


and f(x,y) = 0 otherwise. 


a. Find the probability P( < X <3,4<Y< 2). 

b. Determine the joint distribution function of X and Y for a and 6 between 
0 and 1. 

c. Use your answer from b to find F(a) for a between 0 and 1. 


d. Apply the rule on page 122 to find the probability density function of X 
from the joint probability density function f(x,y). Use the result to verify 
your answer from c. 


e. Find out whether X and Y are independent. 


9.11 H Let X and Y be two continuous random variables, with the same 
joint probability density function as in Exercise 9.10. Find the probability 
P(X < Y) that X is smaller than Y. 


9.12 The joint probability density function f of the pair (X,Y) is given by 
f(x,y) = K (32+ 8ry) forO<a2<land0<y<2, 


and f(x,y) = 0 for all other values of x and y. Here K is some positive 
constant. 


a. Find Kk. 
b. Determine the probability P(2X < Y). 


132 9 Joint distributions and independence 


9.13 H On a disc with origin (0,0) and radius 1, a point (X,Y) is selected by 
throwing a dart that hits the disc in an arbitrary place. This is best described 
by the joint probability density function f of X and Y, given by 


c ifa*+y?<1 


0 otherwise, 


Head=4 


where c is some positive constant. 


a. Determine c. 

b. Let R= V/X?+Y? be the distance from (X,Y) to the origin. Determine 
the distribution function F'p. 

c. Determine the marginal density function fx. Without doing any calcula- 
tions, what can you say about fy? 


9.14 An arbitrary point (X,Y) is drawn from the square [—1,1] x [—1, 1]. 
This means that for any region G in the plane, the probability that (X,Y) is 
in G, is given by the area of GMU divided by the area of ON, where UO denotes 
the square [—1, 1] x [—1, 1]: 


area of G 


P((X,¥) €G) = 


area of 


a. Determine the joint probability density function of the pair (X,Y). 


b. Check that X and Y are two independent, U(—1,1) distributed random 
variables. 


9.15 & Let the pair (X,Y) be drawn arbitrarily from the triangle A with 
vertices (0,0), (0,1), and (1,1). 


a. Use Figure 9.5 to show that the joint distribution function F’ of the pair 
(X,Y) satisfies 


0 for a or b less than 0 
a(2b—a) for (a,b) in the triangle A 

F(ab) =< b? for b between 0 and 1 and a larger than b 
2a — a? for a between 0 and 1 and 6 larger than 1 


1 for a and 6 larger than 1. 


b. Determine the joint probability density function f of the pair (X,Y). 
c. Show that fx (x) = 2-2 for x between 0 and 1 and that fy(y) = 2y for 
y between 0 and 1. 


9.16 (Continuation of Exercise 9.15) An arbitrary point (U,V) is drawn from 
the unit square [0, 1] x [0,1]. Let X and Y be defined as in Exercise 9.15. Show 
that min{U,V} has the same distribution as X and that max{U,V} has the 
same distribution as Y. 


9.7 Exercises 133 


(0, 1) (1,1) 


Rectangle (—oo, a] x (—oo, D] 
(0,0) 


Fig. 9.5. Drawing (X,Y) from (—oo, a] x (—o0, JN A. 


9.17 Let U;, and U2 be two independent random variables, both uniformly 
distributed over [0,a]. Let V = min{U;,U2} and Z = max{U;, U2}. Show 
that the joint distribution function of V and Z is given by 


t? —(t-s)? 
F(s,t) =P(V < 3,2 <t) = ——,—— for0<s<t<a. 
a 


Hint: note that V < s and Z < t happens exactly when both U; < t and 
U2 <t, but not both s < U, <t ands < U2 <t. 


9.18 Suppose a vase contains balls numbered 1,2,...,N. We draw n balls 
without replacement from the vase. Each ball is selected with equal probability, 
i.e., in the first draw each ball has probability 1/N, in the second draw each 
of the N — 1 remaining balls has probability 1/(V — 1), and so on. For i = 
1,2,...,n, let X; denote the number on the ball in the ith draw. We have 
shown that the marginal probability mass function of X; is given by 


px,(k)=—, for k=1,2,...,N. 


a. Show that Ned 
E[X,] = a. 


b. Compute the variance of X;. You may use the identity 


1 
L+4+9+---+N?=—N(N +1)(2N +1). 


9.19 4) Let X and Y be two continuous random variables, with joint proba- 
bility density function 


f(a, y) = 30 —50n?—50y?+802y 
TT 


for —oo < % < oo and —oo < y < o; see also Figure 9.2. 


134 9 Joint distributions and independence 
a. Determine positive numbers a, 6, and c such that 
50x? — 80xy + 50y” = (ay — bx)? 4 ca. 


b. Setting w= én, and ¢ = a show that 


(Rly — VBE)? = 3 (YF) 


Oo 


and use this to show that 


ia e7 (V50y—V322)? dy = Vv 20 


—co 


c. Use the results from b to determine the probability density function fx 
of X. What kind of distribution does X have? 


9.20 Suppose we throw a needle on a large sheet of paper, on which horizontal 
lines are drawn, which are at needle-length apart (see also Exercise 21.16). 
Choose one of the horizontal lines as x-axis, and let (X,Y) be the center of the 
needle. Furthermore, let Z be the distance of this center (X,Y) to the nearest 
horizontal line under (X,Y), and let H be the angle between the needle and 
the positive x-axis. 


a. Assuming that the length of the needle is equal to 1, argue that Z has 
a U(0,1) distribution. Also argue that H has a U(0,7) distribution and 
that Z and H are independent. 


b. Show that the needle hits a horizontal line when 


1 1 
Z< psn or 1-Z< 5 ne: 


c. Show that the probability that the needle will hit one of the horizontal 
lines equals 2/7. 


10 


Covariance and correlation 


In this chapter we see how the joint distribution of two or more random vari- 
ables is used to compute the expectation of a combination of these random 
variables. We discuss the expectation and variance of a sum of random vari- 
ables and introduce the notions of covariance and correlation, which express 
to some extent the way two random variables influence each other. 


10.1 Expectation and joint distributions 


China vases of various shapes are produced in the Delftware factories in the 
old city of Delft. One particular simple cylindrical model has height H and 
radius R centimeters. Due to all kinds of circumstances—the place of the vase 
in the oven, the fact that the vases are handmade, etc.—H and R are not 
constants but are random variables. The volume of a vase is equal to the 
random variable V = 7H R?, and one is interested in its expected value E[V]. 
When fy denotes the probability density of V, then by definition 


E[V] = /. ufv(v) dv. 


However, to obtain E[V], we do not necessarily need to determine fy from 
the joint probability density f of H and R! Since V is a function of H and R, 
we can use a rule similar to the change-of-variable formula from Chapter 7: 


E[V] =E[rHR?] = [. [. mhr? f(h,r) dhdr. 


Suppose that H has a U(25,35) distribution and that R has a U(7.5,12.5) 
distribution. In the case that H and R are also independent, we have 


136 10 Covariance and correlation 


loo) oe) 35 12.5 1 1 
E[V] =} / nh? fu(h)fn(r)ahar =f i thr?» —-=dhdr 


1 35 12.5 
= hah f r? dr = 9621.127 cm?®. 
50 Jos 7.5 


This illustrates the following general rule. 
‘TWO-DIMENSIONAL CHANGE-OF-VARIABLE FORMULA. Let X and 
Y be random variables, and let g : R? — R be a function. 


If X and Y are discrete random variables with values a1, a2,... and 
b;, b2,..., respectively, then 


Elg(X,¥)] = 32) 9(ai,b,)P(X = a3, ¥ = by). 


If X and Y are continuous random variables with joint probability 
density function f, then 


Blo(xXY= ff o(e.s)sle.y)aedy, 


As an example, take g(x, y) = xy for discrete random variables X and Y with 
the joint probability distribution given in Table 10.1. The expectation of XY 
is computed as follows: 


B[XY] = (0-0)-0+(1-0)-5 + (2-0). 


+ (0-1)-5+(1-1)-04 (2-1): 


+ (0-2)-0-4(1-2)- 24 (2-2): =, 


A natural question is whether this value can also be obtained from E[X] E[Y]. 
We return to this question later in this chapter. First we address the expec- 
tation of the sum of two random variables. 


Table 10.1. Joint probabilities P(X = a,Y = b). 


a 
’ © 1 4 
0 oO 1/4 0 
1 1/4 0 1/4 
2 0 1/4 0 


10.1 Expectation and joint distributions 137 


QUICK EXERCISE 10.1 Compute E[X + Y] for the random variables with the 
joint distribution given in Table 10.1. 


For discrete X and Y with values a),a2,... and 61, b2,..., respectively, we 
see that 


E[X+Y]= > > S (ai +;)P(X = ai, ¥ = by) 
=\)) aPC =a. ¥ =0) 45.) bP Han Y =5,) 
= Sal OPx= a, y=%,)) 


= Da TP =a, y=%,)) 
= 5" aP(X =a;) + 5° d;P(Y = 8) 
= E[X]+E[Y]. 


A similar line of reasoning applies in case X and Y are continuous random 
variables. The following general rule holds. 


LINEARITY OF EXPECTATIONS. For all numbers r, s, and t and 
random variables X and Y, one has 


E[rX + sY +¢| =rE[X] + sE[Y] +t. 


QUICK EXERCISE 10.2 Determine the marginal distributions for the random 
variables X and Y with the joint distribution given in Table 10.1, and use 
them to compute E[X] en E[Y]. Check that E[X]+E[Y] is equal to E[X + Y], 
which was computed in Quick exercise 10.1. 


More generally, for random variables X1,...,X;,, and numbers s1,..., 8, and t, 
E[syX1 +--+ + 5,Xy, +t] = s,E[Xq] +---+s,E[X,] +t. 


This rule is a powerful instrument. For example, it provides an easy way to 
compute the expectation of a random variable X with a Bin(n, p) distribution. 
If we would use the definition of expectation, we have to compute 


E[X] = 5" &P(X =k) = a(t) ona — p)r-k, 
k=0 k=0 


To determine this sum is not straightforward. However, there is a simple alter- 
native. Recall the multiple-choice example from Section 4.3. We represented 


138 10 Covariance and correlation 


the number of correct answers out of 10 multiple-choice questions as a sum of 
10 Bernoulli random variables. More generally, any random variable X with 
a Bin(n, p) distribution can be represented as 


X=R,+Ro+---4+ Ry, 


where Ri, R2,..., Rn are independent Ber(p) random variables, i.e., 


R= 1 with probability p 
‘"~ )0. with probability 1 — p. 


Since E[R;] = 0-(1—p)+1-p=p, for every i = 1,2,...,n, the linearity-of 
expectations rule yields 


E[X] = E[Ri] + E[Ro] + --- + E[Rn] = np. 


Hence we conclude that the expectation of a Bin(n, p) distribution equals np. 


Remark 10.1 (More than two random variables). In both the discrete 
and continuous cases, the change-of-variable formula for n random variables 
is a straightforward generalization of the change-of-variable formula for two 
random variables. For instance, if X1, X2,...,Xn are continuous random 
variables, with joint probability density function f, and g is a function from 
R” to R, then 


Blo(Xa,--- Xa) = fh ff g(@1,---,@n)f(x1,...,%n)da1-+-darn. 


10.2 Covariance 


In the previous section we have seen that for two random variables X and Y 
always 

E[X+Y]=E[X]+E[Y]. 
Does such a simple relation also hold for the variance of the sum Var(X + Y) 
or for expectation of the product E[XY]? We will investigate this in the 
current section. 
For the variables X and Y from the example in Section 9.2 with joint proba- 
bility density 


2 
f(a,y) = ze (2a°y + zy”) forO<a<3 and 1l<y<2, 


one can show that 
989 791 4747 


939 —— 
(X+Y)= — =] X Y)= = 
Var(X + Y) and Var(X) + Var(Y) = 5555 + Tooo0 ~ 10000 


~ 2000 


10.2 Covariance 139 


(see Exercise 10.10). This shows, in contrast to the linearity-of-expectations 
rule, that Var(X + Y) is generally not equal to Var(X)+ Var(Y). To deter- 
mine Var(X + Y), we exploit its definition: 


Var(X + Y) =E[(X+Y-E[X+Y])’]. 
Now X +Y —E[X + Y] = (X —E[X]) + (VY —E[Y]), so that 
(X+Y —-E[X +Y])? = (X -E[X])? + (Y - E[Y])’ 
(I(x EX) YB). 


Taking expectations on both sides, another application of the linearity-of 
expectations rule gives 


Var(X + Y) = Var(X) + Var(Y) + 2E[(X — E[X])(Y -E[Y))]. 


That is, the variance of the sum X + Y equals the sum of the variances of X 
and Y, plus an extra term 2E[(X — E[X])(Y — E[Y])]. To some extent this 
term expresses the way X and Y influence each other. 


DEFINITION. Let X and Y be two random variables. The covariance 
between X and Y is defined by 


Cov(X,Y) = E[(X — E[X])(Y - E[Y])]. 


Loosely speaking, if the covariance of X and Y is positive, then if X has a 
realization larger than E[X], it is likely that Y will have a realization larger 
than E[Y], and the other way around. In this case we say that X and Y are 
positively correlated. In case the covariance is negative, the opposite effect oc- 
curs; X and Y are negatively correlated. In case Cov(X, Y) = 0 we say that X 
and Y are uncorrelated. An easy consequence of the linearity-of-expectations 
property (see Exercise 10.19) is the following rule. 


AN ALTERNATIVE EXPRESSION FOR THE COVARIANCE. Let X and 
Y be two random variables, then 


Cow(X,Y) = E[XY] — E[X]E[Y]. 


For X and Y from the example in Section 9.2, we have E[X] = 109/50, 
E[Y] = 157/100, and E[XY] = 171/50 (see Exercise 10.10). Thus we see that 
X and Y are negatively correlated: 


171 109 157 13 
X,Y) = — - — .. — = -— ; 
COG sk 1g SO 
Moreover, this also illustrates that, in contrast to the expectation of the sum, 


for the expectation of the product, in general E[XY] is not equal to E[X] E[Y]. 


140 10 Covariance and correlation 


Independent versus uncorrelated 


Now let X and Y be two independent random variables. One expects that X 
and Y are uncorrelated: they have nothing to do with one another! This is 
indeed the case, for instance, if X and Y are discrete; one finds that 


E[XY] = S° YS - aibjP(X = ai, Y = by) 


= S- » ajbjP(X = ax) P(Y = bj) 


a iy 
= E[X]E[Y]. 
A similar reasoning holds in case X and Y are continuous random variables. 
The alternative expression for the covariance leads to the following important 
observation. 


INDEPENDENT VERSUS UNCORRELATED. If two random variables 
X and Y are independent, then X and Y are uncorrelated. 


Note that the reverse is not necessarily true. If X and Y are uncorrelated, 
they need not be independent. This is illustrated in the next quick exercise. 


QUICK EXERCISE 10.3 Consider the random variables X and Y with the joint 
distribution given in Table 10.1. Check that X and Y are dependent, but that 
also E[XY] = E[X]E[Y]. 

From the preceding we also deduce the following rule on the variance of the 
sum of two random variables. 


VARIANCE OF THE SUM. Let X and Y be two random variables. 
Then always 


Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y). 
If X and Y are uncorrelated, 


Var(X + Y) = Var(X) + Var(Y). 


Hence, we always have that E[X + Y] = E[X]+E[Y], whereas Var(X + Y) = 
Var(X)+ Var(Y) only holds for uncorrelated random variables (and hence for 
independent random variables!). 

As with the linearity-of-expectations rule, the rule for the variance of the 
sum of uncorrelated random variables holds more generally. For uncorrelated 
random variables X1, X2,...,Xn, we have 


10.3 The correlation coefficient 141 
Var(X1 + X2 +--+ + X,) = Var(X1) + Var(X2) +--+ + Var(X,) . 


This rule provides an easy way to compute the variance of a random variable 
with a Bin(n, p) distribution. Recall the representation for a Bin(n, p) random 
variable X: 

X=R,+Ro+---+ Rp. 


Each R; has variance 
Var(Ri) = E[R?] — (E[Ri])” = 0? -(1—p) +1? -p— (E[Ri])’ 
=p—p’ =p(l—p). 
Using the independence of the R;, the rule for the variance of the sum yields 


Var(X) = Var(R,) + Var(R2) +---+ Var(R,) = np(1 — p). 


10.3 The correlation coefficient 


In the previous section we saw that the covariance between random vari- 
ables gives an indication of how they influence one another. A disadvan- 
tage of the covariance is the fact that it depends on the units in which the 
random variables are represented. For instance, suppose that the length in 
inches and weight in kilograms of Dutch citizens are modeled by random vari- 
ables L and W. Someone prefers to represent the length in centimeters. Since 
1 inch = 2.53 cm, one is dealing with a transformed random variable 2.53L. 
The covariance between 2.53L and W is 


Cov(2.53L, W) = E[(2.53L)W] — E[2.53L] E[W] 
= 2.53 (E [LW] —B[L]E (w]) = 2.53 Cov(L, W). 
That is, the covariance increases with a factor 2.53, which is somewhat dis- 
turbing since changing from inches to centimeters does not essentially alter 
the dependence between length and weight. This illustrates that the covari- 


ance changes under a change of units. The following rule provides the exact 
relationship. 


COVARIANCE UNDER CHANGE OF UNITS. Let X and Y be two 
random variables. ‘Then 


Cov(rX + s,tY + u) = rtCov(xX,Y) 
for all numbers r,s,t, and wu. 


See Exercise 10.14 for a derivation of this rule. 


142 10 Covariance and correlation 


QUICK EXERCISE 10.4 For X and Y in the example in Section 9.2 (see also 
Section 10.2), show that Cov(—2X + 7,5Y — 3) = 13/500. 


The preceding discussion indicates that the covariance Cov(X,Y) may not 
always be suitable to express the dependence between X and Y. For this 
reason there is a standardized version of the covariance called the correlation 
coefficient of X and Y. 


DEFINITION. Let X and Y be two random variables. The correlation 
coefficient p(X,Y) is defined to be 0 if Var(X) = 0 or Var(Y) = 0, 
and otherwise Cov(X,Y) 
ov(X, 
(xX, Y) = ===. 
Var(X) Var(Y) 


Note that p(X,Y) remains unaffected by a change of units, and therefore it 
is dimensionless. For instance, if X and Y are measured in kilometers, then 
Cov(X,Y), Var(X) and Var(Y) are in km?, so that the dimension of p(X, Y) 
is in km?/(Vkm? - Vkm?). 

For X and Y in the example in Section 9.2, recall that Cov(X, Y) = —13/5000. 
We also have Var(X) = 989/2500 and Var(Y) = 791/10000 (see Exer- 
cise 10.10), so that 


QUICK EXERCISE 10.5 For X and Y in the example in Section 9.2, show that 
p(—2X +7,5Y — 3) = 0.0147. 


The previous quick exercise illustrates the following linearity property for the 
correlation coefficient. For numbers r,s,t, and wu fixed, r,t 4 0, and random 
variables X and Y: 


—p(X,Y) ifrt <0, 


ere gare 
a ) tee if rt > 0. 


Thus we see that the size of the correlation coefficient is unaffected by a change 
of units, but note the possibility of a change of sign. 


Two random variables X and Y are “most correlated” if X = Y orif X =—Y. 
As a matter of fact, in the former case p(X,Y) = 1, while in the latter case 
p(X,Y) = —1. In general—for nonconstant random variables X and Y—the 


following property holds: 


For a formal derivation of this property, see the next remark. 


10.4 Solutions to the quick exercises 143 


Remark 10.2 (Correlations are between —1 and 1). Here we give a 
proof of the preceding formula. Since the variance of any random variable 
is nonnegative, we have that 


O< val aoe + en | 
~ J Var(X) \/Var(Y) 
= Var — + Var (sa 
Var(X) Var(Y) 


+ 20 (Ae <r) 
J/Var(X)’ \/Var(Y) 

a 

~ Var(X) © Var(Y) Var(X) Var(¥) | 2(1+ p(X,Y)). 


This implies p(X, Y) > —1. Using the same argument but replacing X by 
—X shows that p(X,Y) <1. 


10.4 Solutions to the quick exercises 

10.1 The expectation of X + Y is computed as follows: 
E[X +Y] = (0+0)-0+(1+0)-7+(2+0)-0 

i (0+1)-74(141)-04+@41)-5 

+ (0+2)-04+(14+2)-74+(242)-0 


b 0 1 2 P(Y=b) 

0 0 1/4 0 1/4 

1 1/4 0 1/4 1/2 

2 0 1/4 0 1/4 
P(X=a) 1/4 1/2 1/4 1 


It follows that E[X] = 0-4+1-3+2- 4 = 1, and similarly E[Y] = 1. 
Therefore E[X] + E[Y] = 2, which is equal to E[X + Y] as computed in 
Quick exercise 10.1. 


144 10 Covariance and correlation 


10.3 From Table 10.1, as completed in Quick exercise 10.2, we see that X 
and Y are dependent. For instance, P(X =0,Y =0) #4 P(X =0)P(Y =0). 
From Quick exercise 10.2 we know that E[X] = E[Y] = 1. Because we already 
computed E[XY] = 1, it follows that E[XY] = E[|X]E[Y]. According to the 
alternative expression for the covariance this means that Cov(X,Y) = 0, ie., 
X and Y are uncorrelated. 


10.4 We already computed Cov(X, Y) = —13/5000 in Section 10.2. Hence, by 
the linearity-of-covariance rule Cov(—2X + 7,5Y — 3) = (—2)-5-(—13/5000) = 
13/500. 


10.5 From Quick exercise 10.4 we have Cov(—2X +7,5Y — 3) = 13/500. 
Since Var(X) = 989/2500 and Var(Y) = 791/10000, by definition of the 
correlation coefficient and the rule for variances, 


Cov(—2N +7, 5Y = 3) 


p(-28475¥ 3) = —— 
Var(—2.X + 7)- Var(5Y — 3) 


13 13 

= —— i ____ = —_ 0 = 0.0147. 
\/4Var(X) - 25Var(Y) 3956 | 19775 
2500 ° 10000 


10.5 Exercises 


10.1 & Consider the joint probability distribution of X and Y from Exer- 
cise 9.7, obtained from data on hair color and eye color, for which we already 
computed the expectations and variances of X and Y, as well as E[XY]. 


a. Compute Cov(X,Y). Are X and Y positively correlated, negative corre- 
lated, or uncorrelated? 


b. Compute the correlation coefficient between X and Y. 


10.2 4) Consider the two discrete random variables X and Y with joint dis- 
tribution derived in Exercise 9.2: 


b 0 1 2 P(Y=b) 

=} 1/6 1/6 1/6 1/2 

1 0 1/2 0 1/2 
P(X=a) 1/6 2/3 1/6 1 


a. Determine E[XY]. 
b. Note that X and Y are dependent. Show that X and Y are uncorrelated. 


10.5 Exercises 145 
c. Determine Var(X +Y). 
d. Determine Var(X — Y). 


10.3 Let U and V be the two random variables from Exercise 9.6. We have 
seen that U and V are dependent with joint probability distribution 


b 0 1 2 P(V=d) 

0 1/4 0 1/4 1/2 

1 0 1/2 0 1/2 
P(U=a) 1/4 1/2 1/4 1 


Determine the covariance Cov(U, V) and the correlation coefficient p (U,V). 


10.4 Consider the joint probability distribution of the discrete random vari- 
ables X and Y from the Melencolia Exercise 9.1. Compute Cov(X,Y). 


b 1 2 3 4 


1 16/136 3/136 2/136 13/136 
2 5/136 10/136 11/136 8/136 
3. 9/136 6/136 7/136 12/136 
4 4/136 15/136 14/136 1/136 


10.5 | Suppose X and Y are discrete random variables taking values 0,1, 
and 2. The following is given about the joint and marginal distributions: 


a 
b 0 1 2 P(Y =D) 
0 8/72... 10/72 1/3 
1 12772 9/72. 8.: 1/2 
2 3/72 beg 
P(X=a) 1/3... ... 1 


a. Complete the table. 

b. Compute the expectation of X and of Y and the covariance between X 
and Y. 

c. Are X and Y independent? 


146 10 Covariance and correlation 


10.6 H Suppose X and Y are discrete random variables taking values c—1, c, 
and c+ 1. The following is given about the joint and marginal distributions: 


a 
b c-l1 ec ctl P(Y=8) 
c—l 2/45 9/45 4/45 1/3 
c 7/45 5/45 3/45 1/3 
ctl 6/45 1/45 8/45 1/3 
P(X=a) 1/3 1/3 1/3 1 


a. Take c = 0 and compute the expectation of X and of Y and the covariance 
between X and Y. 

b. Show that X and Y are uncorrelated, no matter what the value of c is. 
Hint: one could compute Cov(X,Y), but there is a short solution using 
the rule on the covariance under change of units (see page 141) together 
with part a. 

c. Are X and Y independent? 


10.7 ©) Consider the joint distribution of Quick exercise 9.2 and take ¢ fixed 
between —1/4 and 1/4: 


b 
a 0 if px (a) 
0 1/4—e 1/4+6¢ 1/2 
1 /4te 1/4—e 1/2 


py (b) 1/2 1/2 1 


a. Take ¢ = 1/8 and compute Cov(X,Y). 
b. Take ¢ = 1/8 and compute p(X,Y). 
c. For which values of ¢ is p(X,Y) equal to —1, 0, or 1? 


10.8 Let X and Y be random variables such that 
E[X]=2, E[Y]=3, and Var(X) =4. 


a. Show that E[X?] = 8. 
b. Determine the expectation of -2X?+Y. 


10.9 H Suppose the blood of 1000 persons has to be tested to see which ones 
are infected by a (rare) disease. Suppose that the probability that the test 


10.5 Exercises 147 


is positive is p = 0.001. The obvious way to proceed is to test each person, 
which results in a total of 1000 tests. An alternative procedure is the following. 
Distribute the blood of the 1000 persons over 25 groups of size 40, and mix 
half of the blood of each of the 40 persons with that of the others in each 
group. Now test the aggregated blood sample of each group: when the test is 
negative no one in that group has the disease; when the test is positive, at 
least one person in the group has the disease, and one will test the other half 
of the blood of all 40 persons of that group separately. In total, that gives 41 
tests for that group. Let X; be the total number of tests one has to perform 
for the ith group using this alternative procedure. 


a. Describe the probability distribution of X;, i.e., list the possible values it 
takes on and the corresponding probabilities. 

b. What is the expected number of tests for the ith group? What is the 
expected total number of tests? What do you think of this alternative 
procedure for blood testing? 


10.10 Consider the variables X and Y from the example in Section 9.2 
with joint probability density 


2 
f(x,y) = (22’y+ay?) forO<a2<3 and 1<y<2 


75 
and marginal probability densities 
2 2 
fx(x) = 555 (8% + 72) for0<a<3 
1 
fy (y) = = (8y? + 12y) forl<y<2. 


25 

a. Compute E[X], E[Y], and E[X + Y]. 

b. Compute E[X?], E[Y?], ELXY], and E[(X + Y)?], 

c. Compute Var(X + Y), Var(X), and Var(Y) and check that Var(X + Y) 4 
Var(X) + Var(Y). 


10.11 Recall the relation between degrees Celsius and degrees Fahrenheit 
9 
degrees Fahrenheit = a degrees Celsius + 32. 


Let X and Y be the average daily temperatures in degrees Celsius in Ams- 
terdam and Antwerp. Suppose that Cov(X,Y) = 3 and p(X,Y) = 0.8. Let T 
and S be the same temperatures in degrees Fahrenheit. Compute Cov(T, $') 
and p(T, S). 


10.12 Consider the independent random variables H and R from the vase 
example, with a U(25,35) and a U(7.5,12.5) distribution. Compute E[H] 
and E|R?] and check that E[V] = 7E[H] E[R?]. 


148 10 Covariance and correlation 


10.13 Let X and Y be as in the triangle example in Exercise 9.15. Recall from 
Exercise 9.16 that X and Y represent the minimum and maximum coordinate 
of a point that is drawn from the unit square: X = min{U,V} and Y = 
max{U, V}. 


a. Show that E[X] = 1/3, Var(X) = 1/18, E[Y] = 2/3, and Var(Y) = 1/18. 
Hint: you might consult Exercise 8.15. 


b. Check that Var(X + Y) = 1/6, by using that U and V are independent 
and that X +Y =U+YV. 


c. Determine the covariance Cov(X, Y) using the results from a and b. 


10.14 H Let X and Y be two random variables and let r,s,t, and wu be 
arbitrary real numbers. 


a. Derive from the definition that Cov(X + s,Y + u) = Cov(X,Y). 

b. Derive from the definition that Cov(rx,tY) = rtCov(x,Y). 

c. Combine parts a and b to show Cov(rX + s,tY + u) = rtCov(x,Y). 
10.15 In Figure 10.1 three plots are displayed. For each plot we carried out a 


simulation in which we generated 500 realizations of a pair of random variables 
(X,Y). We have chosen three different joint distributions of X and Y. 


Fig. 10.1. Some scatterplots. 


a. Indicate for each plot whether it corresponds to random variables X and 
Y that are positively correlated, negatively correlated, or uncorrelated. 

b. Which plot corresponds to random variables X and Y for which |p(X,Y)| 
is maximal? 


10.16 ©] Let X and Y be random variables. 


a. Express Cov(X,X + Y) in terms of Var(X) and Cov(X,Y). 


b. Are X and X + Y positively correlated, uncorrelated, or negatively cor- 
related, or can anything happen? 


10.5 Exercises 149 


c. Same question as in part b, but now assume that X and Y are uncorre- 
lated. 


10.17 Extending the variance of the sum rule. For mathematical con- 
venience we first extend the sum rule to three random variables with zero 
expectation. Next we further extend the rule to three random variables with 
nonzero expectation. By the same line of reasoning we extend the rule to n 
random variables. 


a. Let X,Y and Z be random variables with expectation 0. Show that 


Var(X + Y + Z) = Var(X) + Var(Y) + Var(Z) 
+ 2Cov(X, Y) + 2Cov(X, Z) + 2Cov(Y, Z) . 


Hint: directly apply that for real numbers y1,..., Yn 


(yi te +n)? =yi t+ + yh + 2yrye + Qyrys + +++ + 2Yn—1y)n- 


b. Now show a for X,Y, and Z with nonzero expectation. 
Hint: you might use the rules on pages 98 and 141 about variance and 
covariance under a change of units. 


c. Derive a general variance of the sum rule, i.e., show that if X1, Xo,...,Xn 
are random variables, then 


Var, 2X) 
= Var(X1) +---+Var(X;,) 
+2Cov(X1, X2) + 2Cov(X1, X3) +-+- + 2Cov(X1, Xn) 
4 9Coy( Ny, Ha) ob 2Cov (Xa, Xp) 


Cou Ke i 


2 


d. Show that if the variances are all equal to o~ and the covariances are all 


equal to some constant y, then 


Var(X1 + Xo+-:-+ Xn) = no? + n(n —1)y. 


10.18 H Consider a vase containing balls numbered 1,2,...,N. We draw 
n balls without replacement from the vase. Each ball is selected with equal 
probability, i.e., in the first draw each ball has probability 1/N, in the second 
draw each of the N — 1 remaining balls has probability 1/(NV — 1), and so 
on. For i= 1,2,...,n, let X; denote the number on the ball in the 7th draw. 
From Exercise 9.18 we know that the variance of X; equals 


Var(X;) = tN ~1)(N +1). 


150 10 Covariance and correlation 


Show that 


1 
Cov(X1, X2) = “Tat + 1). 


Before you do the exercise: why do you think the covariance is negative? 
Hint: use Var(X, + Xo +---+ Xn) =0 (why?), and apply Exercise 10.17. 


10.19 Derive the alternative expression for the covariance: Cov(X,Y) = 
E[XY] — B[X]E[Y]. 
Hint: work out (X — E[X])(Y — E[Y]) and use linearity of expectations. 


10.20 Determine p(U,U?) when U has a U(0,a) distribution. Here a is a 
positive number. 


11 


More computations with more random 
variables 


Often one is interested in combining random variables, for instance, in taking 
the sum. In previous chapters, we have seen that it is fairly easy to describe 
the expected value and the variance of this new random variable. Often more 
details are needed, and one also would like to have its probability distribu- 
tion. In this chapter we consider the probability distributions of the sum, the 
product, and the quotient of two random variables. 


11.1 Sums of discrete random variables 


In a solo race across the Pacific Ocean, a ship has one spare radio set for 
communications. Each of the two radios has probability p of failing each time 
it is switched on. The skipper uses the radio once every day. Let X be the 
number of days the radio is switched on until it fails (so if the radio can be 
used for two days and fails on the third day, X attains the value 3). Similarly, 
let Y be the number of days the spare radio is switched on until it fails. Note 
that these random variables are similar to the one discussed in Section 4.4, 
which modeled the number of cycles until pregnancy. Hence, X and Y are 
Geo(p) distributed random variables. Suppose that p = 1/75 and that the 
trip will last 100 days. Then at first sight the skipper does not need to worry 
about radio contact: the number of days the first radio lasts is X — 1 days, 
and similarly the spare radio lasts Y — 1 days. Therefore the expected number 
of days he is able to have radio contact is 
i ee 
Bile Shae el Pen EE Sr er 148 days! 

The skipper—who has some training in probability theory—still has some 
concerns about the risk he runs with these two radios. What if the probability 
P(X + Y — 2 < 99) that his two radios break down before the end of the trip 
is large? 


152 11 More computations with more random variables 


This example illustrates that it is important to study the probability distri- 
bution of the sum Z = X + Y of two discrete random variables. The random 
variable Z takes on values a; + b;, where a; is a possible value of X and 6; 
of Y. Hence, the probability mass function of Z is given by 


ofd= SY Pix=a,¥=8,), 
(4,j):ai+bj=c 


where the sum runs over all possible values a; of X and b; of Y such that 
a; +b; = c. Because the sum only runs over values a; that are equal to c— b;, 
we simplify the summation and write 


= 5 P(X =c— bj, Y =b;), 
J 


where the sum runs over all possible values b; of Y. When X and Y are 
independent, then P(X =c—b;, Y =b;) = P(X =c—}b,;)P(Y =),). This 
leads to the following rule. 


ADDING TWO INDEPENDENT DISCRETE RANDOM VARIABLES. Let X 
and Y be two independent discrete random variables, with probabil- 
ity mass functions px and py. Then the probability mass function 
pz of Z= X +Y satisfies 


= depx(e = b;)py (b;), 


where the sum runs over all possible values b; of Y. 


QUICK EXERCISE 11.1 Let S be the sum of two independent throws with 
a die, so S = X + Y, where X and Y are independent, and P(X =k) = 
P(Y =k) =1/6, for k = 1,...,6. Use the addition rule to compute P(S = 3) 
and P(S = 8), and compare your answers with Table 9.2. 

In the solo race example, X and Y are independent Geo(p) distributed random 
variables. Let Z = X + Y; then by the above rule for k > 2 


P(X+Y =k) = = Yopxtk- )py (é). 


Because px (a) = 0 for a < 0, all terms in this sum with ¢ > k vanish, hence 


k-1 


P(X+Y=h)= Yrx(b- £)-py(é) = )((1—p)*-*"p- (1 —p)*"p 
1 


7 wat — py? = (k—1)p?(1— p)*?. 


as 
Il 


Note that X + Y does not have a geometric distribution. 


11.1 Sums of discrete random variables 153 


Remark 11.1 (The expected value of a geometric distribution). 
The preceding gives us the opportunity to calculate the expected value of 
the geometric distribution in an easy way. Since the probabilities of Z add 
up to one: 


1=) pz(k) = > (e- 1p" —-p)*? =p) ( - p); 


2 


it follows that 
— ea 8 
E[X] = }0 ép(1—p)** = =. 
l=1 g 
Returning to the solo race example, it is clear that the skipper does have 


grounds to worry: 


101 
P(X+Y-2<99) =P(X+Y<101)=) P(X+Y =k) 
k=2 
101 
= So(k-1)(#)?(1- #)*? = 0.3904. 
k=2 


The sum of two binomial random variables 


It is not always necessary to use the addition rule for two independent discrete 
random variables to find the distribution of their sum. For example, let X and 
Y be two independent random variables, where X has a Bin(n, p) distribution 
and Y has a Bin(m,p) distribution. Since a Bin(n,p) distribution models 
the number of successes in n independent trials with success probability p, 
heuristically, X + Y represents the number of successes in n + m trials with 
success probability p and should therefore have a Bin(n + m, p) distribution. 


A more formal reasoning is the following. Let 
Ri, Re,..., Rn, 81, S2,...,5m 


be independent Ber(p) distributed random variables. Recall that a Bin(n, p) 
distributed random variable has the same distribution as the sum of n inde- 
pendent Ber(p) distributed random variables (see Section 4.3 or 10.2). Hence 
X has the same distribution as Rj + Ro +---+ R, and Y has the same 
distribution as S; + So+---+S),. This means that X + Y has the same dis- 
tribution as the sum of n+ m independent Ber(p) variables and therefore has 
a Bin(n + m,p) distribution. This can also be verified analytically by means 
of the addition rule, using that X and Y are also independent. 


QUICK EXERCISE 11.2 For i = 1,2,3, let X; be a Bin(n,,p) distributed ran- 
dom variable, and suppose that X 1, X2, and X3 are independent. Argue that 
Z =X, + Xo4+ X3 isa Bin(n; + no + nz,p) distributed random variable. 


154 11 More computations with more random variables 


11.2 Sums of continuous random variables 


Let X and Y be two continuous random variables. What can we say about the 
probability density function of Z = X+Y? We start with an example. Suppose 
that X and Y are two independent, U(0,1) distributed random variables. One 
might be tempted to think that Z is also uniformly distributed. 

Note that the joint probability density function f of X and Y is equal to the 
product of the marginal probability functions fx and fy: 


f(x,y) = fx(x)fy(y)=1 for0<2<land0<y<1, 


and f(a, y) = 0 otherwise. Let us compute the distribution function Fz of Z. 
It is easy to see that F'z(a) = 0 for a < 0 and Fz(a) = 1 for a > 2. For a 
between 0 and 1, let G be that part of the plane below the line x+y = a, and 
let A be the triangle with vertices (0,0), (a,0), and (0,a); see Figure 11.1. 


r+y=a 


Fig. 11.1. The region G in the plane where x+y < a (with 0 < a < 1) intersected 
with A. 


Since f(x,y) = 0 outside [0,1] x [0,1], the distribution function of Z is given 
by 


Fz(a) = P(Z <a) =P(X+Y <a) 


= [f se a eae 


for 0 < a < 1. For the case where 1 < a < 2 one can draw a similar figure (see 
Figure 11.2), from which one can find that 


1 
Fz(a) =1—5(2—a)° forl<a<2. 


11.2 Sums of continuous random variables 155 


r+y=a 


Fig. 11.2. The region G in the plane where x+y < a (with 1 < a < 2) intersected 
with A. 


We see that Z is not uniformly distributed. 


In general, the distribution function Fz of the sum Z of two continuous ran- 
dom variables X and Y is given by 


Fz(a)=P(Z <a)=P(X+Y <a)j= | f(x, y) dx dy. 
(x,y):@+y<a 


The double integral on the right-hand side can be written as a repeated in- 
tegral, first over x and then over y. Note that x and y are between minus 
and plus infinity and that they also have to satisfy «+ y < a or, equivalently, 
x <a-—y. This means that the integral over x runs from minus infinity to 
y — a, and the integral over y runs from minus infinity to plus infinity. Hence 


Fe(a) =f ([- Hewae) dy. 


In case X and Y are independent, the last double integral can be written as 


Pir ee 


a= f Fxla-wivia 


for —co <a < o. Differentiating Fz we find the following rule. 


and we find that 


156 11 More computations with more random variables 


ADDING TWO INDEPENDENT CONTINUOUS RANDOM VARIABLES. 
Let X and Y be two independent continuous random variables, with 
probability density functions fx and fy. Then the probability den- 
sity function fz of Z = X + Y is given by 


a= f * C= ee 


for —co < z< @. 


The single-server queue revisited 


In the single-server queue model from Section 6.4, T; is the time between 
the start at time zero and the arrival of the first customer and TJ; is the 
time between the arrival of the (i — 1)th and ith customer at a well. We are 
interested in the arrival time of the nth customer at the well. For n > 1, let 
Zn, be the arrival time of the nth customer at the well: Z, = T, +---+ Th. 
Since each T; has an Exp(0.5) distribution, it follows from the linearity-of- 
expectations rule in Section 10.1 that the expected arrival time of the nth 
customer is 


E|Z,] = E[Zi+--:+T7,] = E[Zi] +---+ E[Z,] = 2n minutes. 


We would like to know whether the pump capacity is sufficient; for instance, 
when the service times 5; are independent U(2,5) distributed random vari- 
ables (this is the case when the pump capacity v = 1). In that case, at most 
30 customers can pump water at the well in the first hour. If P(Z30 < 60) is 
large, one might be tempted to increase the capacity of the well. 


Recalling that the T; are ae Exp(A) random variables, it follows 
from the addition rule that fr,+7,(z) = 0 if z < 0, and for z > 0 that 
fale) = franle)= f frs(e-v)frsu) dy 
~ if de A°2-) « Nem 4 dy 
0 
=e" | dy = \?ze"*. 
0 


Viewing T; + T2 + 73 as the sum of T; and 7) + 73, we find, by applying the 
addition rule again, that fz,(z) = 0 if z <0, and for z > 0 that 


fas(z )= fT, 4.T24Ts(% ar. fr, (z Y) fT24Ts (y) dy 
= re AE—Y) «Pye AY dy 
0) 


Ss ca ydy = = sie 2 eo, 
(0) 


11.2 Sums of continuous random variables 157 
Repeating this procedure, we find that fz, (z) = 0 if z < 0, and 


_AQe te 
fz,,(2) = ~ (n—1! 


for z > 0. Using integration by parts we find (see Exercise 11.13) that for 
n>landa>0: 


PZ, = a)=1 ~ als 


a! 
i=0 


Since \ = 1/2, it follows that 
P(Z39 < 60) = 0.524. 


Even if each customer fills his jerrican in the minimum time of 2 minutes, we 
see that after an hour with probability 0.524, people will be waiting at the 
pump! 

The random variable Z,, is an example of a gamma random variable, defined 
as follows. 


DEFINITION. A continuous random variable X has a gamma dis- 
tribution with parameters a > 0 and A > 0 if its probability density 
function f is given by f(a) =0 for « < 0 and 


A(z)? o> 


fi >0 
Ta) or x > 0, 


f(x) = 


where the quantity I'(@) is a normalizing constant such that f inte- 
grates to 1. We denote this distribution by Gam/(a, A). 


The quantity I'(a) is for ~w > 0 defined by 


T(a) = i ite di: 
0 
It satisfies fora > 0 andn=1,2,... 
T(a+1)=al(a) and I(n)=(n—-1)! 


(see also Exercise 11.12). It follows from our example that the sum of n inde- 
pendent Exp(A) distributed random variables has a Gam(n, A) distribution, 
also known as the Erlang-n distribution with parameter 4. 


The sum of independent normal random variables 


Using the addition rule you can show that the sum of two independent nor- 
mally distributed random variables is again a normally distributed random 


158 11 More computations with more random variables 


variable. For instance, if X and Y are independent N(0, 1) distributed random 
variables, one has 


fx+y(z) = a fx (z— y)fy (y) dy 


—Co 


es 1 1 2 1 1,,2 

_ a 3 (2-y) —3y 

= e 2 —e 2 d 
= (= ) (= ) . 


co 1 2 
= / (=) eB 2y?—2uz+2") dy, 


V2 


To prepare a change of variables, we subtract the term $2? from 2y? —2yz+2? 
to complete the square in the exponent: 


1 2 
af rw gt= AOD) 


In this way we find with changing integration variables t = /2(y — z/2): 


8 


1 1,2 1 (94,2 1,2 
_ —42 —5(2y°—-2yz+527) 
Z)= e 4 e 2 2 d 
fxsy( ) Jin i Jin y 
Wegte fA 4 va0-2/2))’ 
= e 4% —e 2 2 d 
V2 eo V2 . 
a tes erat ae e 3" dt 


Since ¢ is the probability density of the standard normal distribution, it in- 
tegrates to 1, so that 


which is the probability density of the N(0,2) distribution. Thus, X + Y also 
has a normal distribution. This is more generally true. 


THE SUM OF INDEPENDENT NORMAL RANDOM VARIABLES. If X and 
Y are independent random variables with a normal distribution, then 
X + also has a normal distribution. 


QUICK EXERCISE 11.3 Let X and Y be independent random variables, where 
X has an N(3,16) distribution, and Y an N(5,9) distribution. Then X + Y 
is a normally distributed random variable. What are its parameters? 


Rather surprisingly, independence of X and Y is not a prerequisite, as can be 
seen in the following remark. 


11.3 Product and quotient of two random variables 159 


Remark 11.2 (Sums of dependent normal random variables). We 
say the pair X,Y is has a bivariate normal distribution if their joint prob- 
ability density equals 


sss (Sts) 
QroxoyV1— p? P 2 (1 — p?) Y ’ 


where 


ater) = { (SSH) -ap( 2H) (22H) + (1H). 


Here yx and py are the expectations of X and Y, o% and o% are their 
variances, and p is the correlation coefficient of X and Y. If X and Y have 
such a bivariate normal distribution, then X has an N(x, ox) and Y has 
an N(y,o%-) distribution. Moreover, one can show that X + Y has an 
N(ux + py, ox +o0% + 2poxoy) distribution. An example of a bivariate 
normal probability density is displayed in Figure 9.2. This probability den- 
sity corresponds to parameters wx = ty = 0, 0x = oy = 1/6, and p= 0.8. 


11.3 Product and quotient of two random variables 


Recall from Chapter 7 the example of the architect who wants maximal vari- 
ety in the sizes of buildings. The architect wants more variety and therefore 
replaces the square buildings by rectangular buildings: the buildings should 
be of width X and depth Y, where X and Y are independent and uniformly 
distributed between 0 and 10 meters. Since X and Y are independent, the 
expected area of a building equals E[XY] = E[X] E[Y] = 5-5 = 25m?. But 
what can one say about the distribution of the area Z = XY of an arbitrary 
building? 

Let us calculate the distribution function of Z. Clearly Fz(a) = 0 if a < 0 
and F'z(a) = 1 if a > 100. For a between 0 and 100 we can compute F'z(a) 
with the help of Figure 11.3. 


We find 


Fz (a) 


P(Z <a) =P(XY <a) 
area of the shaded region in Figure 11.3 
area of [0, 10] x [0, 10] 


1 a aude 
Se ip oa 
100 (% +f z : 
1 10 a(1 + 21n 10 — Ina) 
T00 (a+ [ane] 0) 100 


Hence the probability density function fz of Z is given by 


160 11 More computations with more random variables 


Fig. 11.3. The region G in the plane where zy < a intersected with [0, 10] x [0, 10]. 


d 2(1+2Inl0—Inz) _ Inl00—Inz 

dz 100 ~ 100 

for 0 < z < 100m’. 

This computation can be generalized to arbitrary independent continuous 


random variables, and we obtain the following formula for the probability 
density function of the product of two random variables. 


PRODUCT OF INDEPENDENT CONTINUOUS RANDOM VARIABLES. Let 
X and Y be two independent continuous random variables with prob- 
ability densities fx and fy. Then the probability density function 
fz of Z = XY is given by 


fa= [fy (2) tx@ae 


ole —=6S K 2 K CO, 


For the quotient Z = X/Y of two independent random variables X and 
Y it is now fairly easy to derive the probability density function. Since the 
independence of X and Y implies that X and 1/Y are independent, the 
preceding rule yields 


fz(z) = a fisy (=) flo) da. 


Recall from Section 8.2 that the probability density function of 1/Y is given 
by 


fiyy(y) = =fr(=). 


11.3 Product and quotient of two random variables 161 


Substituting this in the integral, after changing the variable of integration, we 
find the following rule. 


QUOTIENT OF INDEPENDENT CONTINUOUS RANDOM VARIABLES. 
Let X and Y be two independent continuous random variables with 
probability densities fx and fy. Then the probability density func- 
tion fz of Z = X/Y is given by 


d= | "cay ries 


for —co < z< @. 


The quotient of two independent normal random variables 


Let X and Y be independent random variables, both having a standard normal 
distribution. When we compute the quotient Z of X and Y, we find a so-called 
standard Cauchy distribution: 


Tw SJ 
1 4 


= ede] i 
wi 22+1 . = +1) 


This is the special case a = 0, @ = 1 of the following family of distributions. 


DEFINITION. A continuous random variable has a Cauchy distribu- 
tion with parameters a and @ > Oif its probability density function f 
is given by 


ee ee 
7 (6? + (a — a)?) 


We denote this distribution by Cau(a, (). 


for —co < 4% < o. 


f(@) = 


By integrating, we find that the distribution function F’ of a Cauchy distri- 
bution is given by 


Tv 


F(z) = st agian (=) : 


The parameter a is the point of symmetry of the probability density func- 
tion f. Note that a is not the expected value of Z. As a matter of fact, it was 
shown in Remark 7.1 that the expected value does not exist! The probabil- 
ity density f is shown together with the distribution function F’ for the case 
a= 2, 8=5 in Figure 11.4. 


162 11 More computations with more random variables 


1 


0.06 th 
F 
0.04 
0.02 
0.00 0 
rs | rs | 
-12 -8 -4 0 4 8 12 16 -12 -8 -4 0 4 8 12 16 


Fig. 11.4. The graphs of f and F of the Cau(2,5) distribution. 


QUICK EXERCISE 11.4 Argue—without doing any calculations—that if Z has 
a standard Cauchy distribution, 1/Z also has a standard Cauchy distribution. 


11.4 Solutions to the quick exercises 


11.1 Using the addition rule we find 


6 
PS =3) = dPx(3 — j)py (3) 


= px(2)py (1) + px (1)py (2) + px (0)py (3) 


+px(—l)py (4) + px(—2)py (5) + px (—3)py (6) 


1 1 1 
=ag tag tT Ot0+0+0=7, 


and 


6 
Ps =s)= dPx(8 — j)py (3) 


= px(7)py(1) +p p 
+px (4)py (4) + px (3)py (5) + px (2)py (6) 
1 


11.2 We have seen that X, + X2 is a Bin(n1 + n2,p) distributed random 
variable. Viewing X, + X2 + X3 as the sum of X; 4+ X2 and Xs, it follows 
that X, + Xo + X3 isa Bin(n; + nz + nz3,p) distributed random variable. 


11.5 Exercises 163 


11.3 The sum rule for two normal random variables tells us that X + Y is 
a normally distributed random variable. Its parameters are expectation and 
variance of X + Y. Hence by linearity of expectations 


Uxey =E[X4+Y] =E[X]+E[Y] =ux +py =34+5=8, 
and by the rule for the variance of the sum 
ox4y = War(X) + Var(Y) + 2Cov(X,Y) = 0% + oF =16+9 = 25, 
using that Cov(X, Y) = 0 due to independence of X and Y. 


11.4 In the examples we have seen that the quotient X/Y of two independent 
standard normal random variables has a standard Cauchy distribution. Since 
Z = X/Y, the random variable 1/Z = Y/X. This is also the quotient of two 
independent standard normal random variables, and it has a standard Cauchy 
distribution. 


11.5 Exercises 


11.1 H Let X and Y be independent random variables with a discrete uniform 
distribution, i.e., with probability mass functions 


1 
px(k) = py(k) = WN fork =1,...,N. 


Use the addition rule for discrete random variables on page 152 to determine 
the probability mass function of Z = X + Y for the following two cases. 


a. Suppose N = 6, so that X and Y represent two throws with a die. Show 


that i 
——— fork =2,...,6, 
36 


pa(k)=P(X+V¥=h=4 


36 
You may check this with Quick exercise 11.1. 


for k= 7,...,12. 


b. Determine the expression for pz(k) for general N. 


11.2 H Consider a discrete random variable X taking values k = 0,1,2,... 
with probabilities 


uk 

— kb) = EF eH 

P(X =k) = hae 

where fz > 0. This is the Poisson distribution with parameter jz. We will learn 
more about this distribution in Chapter 12. This exercise illustrates that the 
sum of independent Poisson variables again has a Poisson distribution. 


164 11 More computations with more random variables 


a. Let X and Y be independent random variables, each having a Poisson 
distribution with yz = 1. Show that for k = 0,1,2,... 


9k 
P(X+¥ =k)= Te’, 
: kk 
by using }°;_9 (*) =D 
b. Let X and Y be independent random variables, each having a Poisson 
distribution with parameters \ and yz. Show that for k = 0,1,2,... 


k 
P(X+Y <4) = BEM on, 
by using yo (¢)p¢(1 — p)*-6 = 1 for p= p/(A+ p)- 
We conclude that X + Y has a Poisson distribution with parameter \+ pu. 


11.3 Let X and Y be two independent random variables, where X has a 
Ber(p) distribution, and Y has a Ber(q) distribution. When p = q = r, we 
know that X + Y has a Bin(2,r) distribution. Suppose that p = 1/2 and 
q = 1/4. Determine P(X + Y =k), for k = 0,1,2, and conclude that X + Y 
does not have a binomial distribution. 


11.4 H Let X and Y be two independent random variables, where X has an 
N(2,5) distribution and Y has an N(5,9) distribution. Define Z = 3X—2Y +1. 


a. Compute E[Z] and Var(Z). 
b. What is the distribution of Z? 
c. Compute P(Z < 6). 


11.5 F) Let X and Y be two independent, U(0,1) distributed random vari- 
ables. Use the rule on addition of independent continuous random variables 
on page 156 to show that the probability density function of X + Y is given 
by 


z forO<z<1l, 
fz(z)=<42-—2z2 forl<z<2, 
0 otherwise. 


11.6 FH Let X and Y be independent random variables with probability den- 
sities 


1 1 
fx(x) = ee and fy(y)= ge. 


Use the rule on addition of independent continuous random variables to de- 
termine the probability density of Z= X +Y. 


11.7 (|) The two random variables in Exercise 11.6 are special cases of 
Gam(a, X) variables, namely with a = 2 and A = 1/2. More generally, let 


11.5 Exercises 165 


X1,...,Xn be independent Gam(k, X) distributed random variables, where 
\ > 0 and & is a positive integer. Argue—without doing any calculations— 
that X, +---+X, has a Gam(nk, A) distribution. 


11.8 We investigate the effect on the Cauchy distribution under a change of 

units. 

a. Let X have a standard Cauchy distribution. What is the distribution of 
Y=rX+s? 

b. Let X have a Cau(a,@) distribution. What is the distribution of the 
random variable (X — a)//? 


11.9 H Let X and Y be independent random variables with a Par(a) and 
Par(Q) distribution. 


a. Take a = 3 and @ = 1 and determine the probability density of Z = XY. 
b. Determine the probability density of Z = XY for general a and £3. 


11.10 Let X and Y be independent random variables with a Par(a) and 
Par() distribution. 


a. Take a = § = 2. Show that Z = X/Y has probability density 


fa(2) z for0<z<1, 
Z) => 
a 1j2? for 1 < 2< 00. 


b. For general a, 3 > 0, show that Z = X/Y has probability density 


a p-1 
ot 
a8 
a+ B zor! 
11.11 Let X1, X2, and X3 be three independent Geo(p) distributed random 
variables, and let Z = X, + X9+ X3. 


for0<z<l, 


forl<z<o. 


a. Show for k > 3 that the probability mass function pz of Z is given by 


pz(k) = P(X, + Xo+ X3=k)= stk =Dik=ipPi—p-: 


b. Use the fact that 577°. pz(k) = 1 to show that 
p’ (E[X?] + E[Xi)) =2. 
c. Use E[X1] = 1/p and part b to conclude that 


oe ie 
E[X?] = = and Var(X1) = a 


2 


166 11 More computations with more random variables 
11.12 Show that (1) = 1, and use integration by parts to show that 
T(a+1)=aT(x) forx>0. 
Use this last expression to show for n = 1,2,... that 
I(n) = (n— 1)! 
11.13 Let Z, have an Erlang-n distribution with parameter A. 
a. Use integration by parts to show that for a > 0 and n > 2: 


a \nyn-le-Arz (Ae) e 
P(Zy <a) =| yr ee FP Sa). 


b. Use a to show that for a > 0: 


=i 


P(Z, <a)= 


c. Conclude that for a > 0: 


PZ. <9) = 1 eS (nak 


i=0 


i! 


12 


The Poisson process 


In many random phenomena we encounter, it is not just one or two random 
variables that play a role but a whole collection. In that case one often speaks 
of a random process. The Poisson process is a simple kind of random process, 
which models the occurrence of random points in time or space. There are 
numerous ways in which processes of random points arise: some examples are 
presented in the first section. The Poisson process describes in a certain sense 
the most random way to distribute points in time or space. This is made more 
precise with the notions of homogeneity and independence. 


12.1 Random points 


Typical examples of the occurrence of random time points are: arrival times 
of email messages at a server, the times at which asteroids hit the earth, 
arrival times of radioactive particles at a Geiger counter, times at which your 
computer crashes, the times at which electronic components fail, and arrival 
times of people at a pump in an oasis. 


Examples of the occurrence of random points in space are: the locations of 
asteroid impacts with earth (2-dimensional), the locations of imperfections in a 
material (3-dimensional), and the locations of trees in a forest (2-dimensional). 


Some of these phenomena are better modeled by the Poisson process than 
others. Loosely speaking, one might say that the Poisson process model often 
applies in situations where there is a very large population, and each member 
of the population has a very small probability to produce a point of the 
process. This is, for instance, well fulfilled in the Geiger counter example 
where, in a huge collection of atoms, just a few will emit a radioactive particle 
(see [28]). A property of the Poisson process—as we will see shortly—is that 
points may lie arbitrarily close together. Therefore the tree locations are not 
so well modeled by the Poisson process. 


168 12 The Poisson process 
12.2 Taking a closer look at random arrivals 


A well-known example that is usually modeled by the Poisson process is that 
of calls arriving at a telephone exchange—the exchange is connected to a large 
number of people who make phone calls now and then. This will be our leading 
example in this section. 

Telephone calls arrive at random times X1, X2,... at the telephone exchange 
during a time interval [0, ¢]. 


} + kK + + | > 
0 X1 Xo X3 X4 X5 t 


The two basic assumptions we make on these random arrivals are 


1. (Homogeneity) The rate \ at which arrivals occur is constant over time: 
in a subinterval of length u the expectation of the number of telephone 
calls is Au. 

2. (Independence) The numbers of arrivals in disjoint time intervals are in- 
dependent random variables. 


Homogeneity is also called weak stationarity. We denote the total number of 
calls in an interval I by N(JI), abbreviating N((0,t]) to N;. Homogeneity then 
implies that we require 

E[N;] = At. 
To get hold of the distribution of N; we divide the interval [0, ¢] into n intervals 
of length t/n. When n is large enough, every interval Ij, = ((j — 1) t/n,j t/n] 
will contain either 0 or 1 arrival: For such a large n (which also satisfies 


Time 


i tt a on ke 
ee Xi X2 X3 Xa Xs 


n > At), let R; be the number of arrivals in the time interval I;,,. Since R; is 
0 or 1, R; has a Ber(p;) distribution for some p;. Recall that for a Bernoulli 
random variable E[R;] = 0- (1 — p;) +1-p; = pj. By the homogeneity 
assumption, for each 7 


At 
pj = A> length of Ijn = —. 
n 
Summing the number of calls in the intervals gives the total number of calls, 


hence 
N=R, + Rot---+ Ry. 


12.2 Taking a closer look at random arrivals 169 


By the independence assumption, the R; are independent random variables, 
therefore N; has a Bin(n,p) distribution, with p = At/n. 


Remark 12.1 (About this approximation). The argument just given 
seems pretty convincing, but actually R; does not have a Bernoulli distri- 
bution, whatever the value of n. A way to see this is the following. Every 
interval Jj, is a union of the two intervals [2;~1,2n and I2;,2n. Hence the 
probability that Ij contains two calls is at least (At/2n)? = d7#?/4n?, 
which is larger than zero. 

Note however, that the probability of having two arrivals is of smaller order 
than the probability that R; takes the value 1. If we add a third assumption, 
namely that the probability of two or more calls arriving in an interval I;,n 
tends to zero faster than 1/n, then the conclusion below on the distribution 
of N; is valid. 


We have found that (at least in first approximation) 


k n—k 
POM =) = (7) (=) (1-~) for k =0,...,n. 


In this analysis n is a rather artificial parameter, of which we only know that 
it should not be “too small.” It therefore seems a good idea to get rid of n 
by letting n go to infinity, hoping that the probability distribution of N; will 
settle down. Note that 


we have indeed run into a probability distribution on the numbers 0,1,2,.... 
Note that all these probabilities are determined by the single value At. This 
motivates the following definition. 


170 12 The Poisson process 


DEFINITION. A discrete random variable X has a Poisson distribu- 
tion with parameter yu, where yw > 0 if its probability mass function p 
is given by 


k 


p(k) = P(X =k) = Fe for k =0,1,2,.... 


We denote this distribution by Pois(). 
Figure 12.1 displays the graphs of the probability mass functions of the Poisson 


distribution with « = 0.9 (left) and the Poisson distribution with w = 5 
(right). 


0.5 0.5 

0.4 ° 0.4 

0.3 0.3 

0.2 0.2 p(k) 

0.1 p(k) 0.1 : , 

0.0 ce ew ew ew we 0.0 s ° 

a ry © |... eT 
0 2 4 6 8 10 0 2 4 #6 8 10 
k; k; 


Fig. 12.1. The probability mass functions of the Pois(0.9) and the Pois(5) distri- 
butions. 


QUICK EXERCISE 12.1 Consider the event “exactly one call arrives in the 
interval [0,2s].” The probability of this event is P(No, = 1) = \-2s-e7*?8. 
But note that this event is the same as “there is exactly one call in the interval 
(0, s) and no calls in the interval [s, 2s], or no calls in [0, s) and exactly one call 
in [s,2s].” Verify (using assumptions 1 and 2) that you get the same answer 
if you compute the probability of the event in this way. 


We do have a hint! about what the expectation and variance of a Poisson 
random variable might be: since E[N;] = At for all n, we anticipate that the 
limiting Poisson distribution will have expectation At. Similarly, since N; has 
a Bin(n, At) distribution, we anticipate that the variance will be 


' This is really not more than a hint: there are simple examples where the distribu- 
tions of random variables converge to a distribution whose expectation is different 
from the limit of the expectations of the distributions! (cf. Exercise 12.14). 


12.3 The one-dimensional Poisson process 171 


At At 
lim Var(N;) = lim n-—- (1 — *) = At. 
n 


n—0o n—-0o n 


Actually, the expectation of a Poisson random variable X with parameter ju 
is easy to compute: 


a ak ut 
_ Eom — eb 
E[X] =} ke ih » 1 
k=0 k=1 
ptt HE 
a oa, im" se Drea 
k=l j=0 


In a similar way the variance can be determined (see Exercise 12.8), and we 
arrive at the following rule. 


THE EXPECTATION AND VARIANCE OF A POISSON DISTRIBUTION. 
Let X have a Poisson distribution with parameter py; then 


E[X]=p and Var(X) =u. 


12.3 The one-dimensional Poisson process 


We will derive some properties of the sequence of random points X1, Xo,... 
that we considered in the previous section. What we derived so far is that for 
any interval (s,s +t] the number N((s,s+#]) of points X; in that interval is 
a random variable with a Pois(At) distribution. 


Interarrival times 


The differences 

T= Xi — Xi-1 
are called interarrival times. Here we define T, = Xj, the time of the first 
arrival. To determine the probability distribution of T,, we observe that the 


event {T; >t} that the first call arrives after time t¢ is the same as the event 
{N; = 0} that no calls have been made in [0,t]. But this implies that 


P(T, <t) =1-P(%) >t) =1-—P(N,=0)=1-e™. 


Therefore T; has an exponential distribution with parameter 4. 


To compute the joint distribution of T, and Th, we consider the conditional 
probability that T> > t, given that T, = s, and use the property that arrivals 
in different intervals are independent: 


172 12 The Poisson process 


P(T2 > t|T, = s) = P(no arrivals in (s,s + t]|Ti = s) 


= P(no arrivals in (s,s + t]) 
= P(N((s,s+7]) =0) =e. 
Since this answer does not depend on s, we conclude that 7; and T> are 
independent, and 
PL Siese™. 


i.e., T2 also has an exponential distribution with parameter A. Actually, al- 
though the conclusion is correct, the method to derive it is not, because we 
conditioned on the event {T, = s}, which has zero probability. This problem 
could be circumvented by conditioning on the event that T; lies in some small 
interval, but that will not be done here. Analogously, one can show that the T; 
are independent and have an Exp(A) distribution. This nice property allows 
us to give a simple definition of the one-dimensional Poisson process. 


DEFINITION. The one-dimensional Poisson process with intensity 
is a sequence Xj, X2, X3,... of random variables having the property 
that the interarrival times X;, X2—X 1, X3—Xo,... are independent 
random variables, each with an Exp(,) distribution. 


Note that the connection with N; is as follows: N; is equal to the number of 
X; that are smaller than (or equal to) t. 


QUICK EXERCISE 12.2 We model the arrivals of email messages at a server as 
a Poisson process. Suppose that on average 330 messages arrive per minute. 
What would you choose for the intensity \ in messages per second? What is 
the expectation of the interarrival time? 


An obvious question is: what is the distribution of X;? This has already been 
answered in Chapter 11: since X; is a sum of 7 independent exponentially 
distributed random variables, we have the following. 


THE POINTS OF THE POISSON PROCESS. For? = 1,2,... the random 
variable X; has a Gam/(i, A) distribution. 


The distribution of points 


Another interesting question is: if we know that n points are generated in an 
interval, where do these points lie? Since the distribution of the number of 
points only depends on the length of the interval, and not on its location, it 
suffices to determine this for an interval starting at 0. Let this interval be (0, a]. 
We start with the simplest case, where there is one point in [0,a]: suppose 
that N({0,a]) =1. Then, for0<s <a: 


12.4 Higher-dimensional Poisson processes 173 


_ P(X <8, N((0,a]) = 1) 


P(N((0, ee 
_ P(N (0, s]) = 1, N((s,a]) = 0) 
P(N((0, a]) = 1) 
sew >*8e7 AM(a-8) 
= Aae~r# 


cs 
We find that conditional on the event {N([0,a]) = 1}, the random variable 
X, is uniformly distributed over the interval [0, a]. 


Now suppose that it is given that there are two points in [0,a]: N((0,a]) = 
2. In a way similar to what we did for one point, we can show that (see 
Exercise 12.12) 


t? — (t-s)? 


P(X < 8,X2 <t| N([0,a]) = 2) = 3 


a 
Now recall the result of Exercise 9.17: if Uy; and U2 are two independent 
random variables, both uniformly distributed over [0, a], then the joint distri- 
bution function of V = min(U;, U2) and Z = max(Uj, U2) is given by 


PV <s,Z<th= forO<s<t<a. 

Thus we have found that, if we forget about their order, the two points in 
(0, a] are independent and uniformly distributed over [0,a]. With somewhat 
more work, this generalizes to an arbitrary number of points, and we arrive 
at the following property. 


LOCATION OF THE POINTS, GIVEN THEIR NUMBER. Given that 
the Poisson process has n points in the interval [a,b], the locations 
of these points are independently distributed, each with a uniform 
distribution on [a, }]. 


12.4 Higher-dimensional Poisson processes 


Our definition of the one-dimensional Poisson process, starting with the in- 
terarrival times, does not generalize easily, because it is based on the ordering 
of the real numbers. However, we can easily extend the assumptions of inde- 
pendence, homogeneity, and the Poisson distribution property. To do this we 
need a higher-dimensional version of the concept of length. We denote the k- 
dimensional volume of a set A in k-dimensional space by m(A). For instance, 
in the plane m(A) is the area of A, and in space m(A) is the volume of A. 


174 12 The Poisson process 


DEFINITION. The k-dimensional Poisson process with intensity 2 

is a collection X1, X2, X3,... of random points having the property 

that if N(A) denotes the number of points in the set A, then 

1. (Homogeneity) The random variable N(A) has a Poisson distri- 
bution with parameter \m/(A). 

2. (Independence) For disjoint sets A,, Ao,..., An the random vari- 
ables N(A1), N(Az),...,.N(An) are independent. 


QUICK EXERCISE 12.3 Suppose that the locations of defects in a certain type of 
material follow the two-dimensional Poisson process model. For this material 
it is known that it contains on average five defects per square meter. What is 
the probability that a strip of length 2 meters and width 5 cm will be without 
defects? 


In Figure 7.4 the locations of the buildings the architect wanted to distribute 
over a 100-by-300-m terrain have been generated by a two-dimensional Poisson 
process. This has been done in the following way. One can again show that 
given the total number of points in a set, these points are uniformly distributed 
over the set. This leads to the following procedure: first one generates a value 
n from a Poisson distribution with the appropriate parameter (A times the 
area), then one generates n times a point uniformly distributed over the 100- 
by-300 rectangle. 

Actually one can generate a higher-dimensional Poisson process in a way that 
is very similar to the natural way this can be done for the one-dimensional 
process. Directly from the definition of the one-dimensional process we see 
that it can be obtained by consecutively generating points with exponentially 
distributed gaps. We will explain a similar procedure for dimension two. For 
5s > 0, let 

M, = N(C;), 


where C, is the circular region of radius s, centered at the origin. Since C, 
has area 7s”, M, has a Poisson distribution with parameter 7s”. Let R; 
denote the distance of the ith closest point to the origin. This is illustrated 
in Figure 12.2. 

Note that R; is the analogue of the ith arrival time for the one-dimensional 
Poisson process: we have in fact that 


Ri<s if and only if M, > i. 
In particular, with i = 1 and s = V4, 
P(R? <t) =P(Ri < vt) =P(Myz>0) =1-0 7", 
In other words: R? is Exp(Am) distributed. For general i, we can similarly 


write 


P(R? <t) =P(R; < vt) =P(Mjz >i). 


12.4 Higher-dimensional Poisson processes 175 


* 


Fig. 12.2. The Poisson process in the plane, with the two circles of the two points 
closest to the origin. 


So : 
_ yet wo (Ant) 
P(R? <t)=1-e" tau 
(RE <2) = 


which means that R? has a Gam/(i, \r) distribution—as we saw on page 157. 
Since gamma distributions arise as sums of independent exponential distribu- 
tions, we can also write 

R2 = R?+Ti, 


au 


where the T; are independent Exp(Am) random variables (and where Ro = 0). 
Note that this is quite similar to the one-dimensional case. To simulate the 
two-dimensional Poisson process from a sequence Uj, U2,... of independent 
U(0,1) random variables, one can therefore proceed as follows (recall from 
Section 6.2 that —(1/A)In(U;) has an Exp(A) distribution): for i = 1,2,... 


put 
/ 1 
R; = Be 4 = Pe In(U2;); 


this gives the distance of the ith point to the origin, and then put the point 
on this circle according to an angle value generated by 27U2;_1. This is the 
correct way to do it, because one can show that in polar coordinates the radius 
and the angle of a Poisson process point are independent of each other, and 
the angle is uniformly distributed over [0,27]. The latter is called the isotropy 
property of the Poisson process. 


176 12 The Poisson process 


12.5 Solutions to the quick exercises 


12.1 The probability of exactly one call in [0,s) and no calls in [s, 2s] equals 
P(N((0, s)) = 1, N([s,2s]) = 0) = P(N ([0, s)) = 
= P(N((0,s)) = 


= \se7*8 . e, 


) P(N ([s, 2s]) = 0) 
) P(N((0, s]) = 0) 


because of independence and homogeneity. In the same way, the probability 
of exactly one call in [s, 2s] and no calls in [0, s) is equal to e~** - Ase~**. And 
indeed: Ase~*9 - 48 + e74* - Ase 48 = 2Ase~*?8. 


12.2 Because there are 60 seconds in a minute, we have 60 = 330. It follows 
that \ = 54. Since the interarrival times have an Exp(A) distribution, the 
expected time between messages is 1/A = 0.18 second. 


12.3 The intensity of this process is \ = 5 per m?. The area of the strip is 


2- (1/20) = 1/10 m?. Hence the probability that no defects occur in the strip 
is ea (area of strip) _ e—5:(1/10) = e7 1/2 — 0.60. 


12.6 Exercises 


12.1 H In each of the following examples, try to indicate whether the Poisson 
process would be a good model. 


. The times of bankruptcy of enterprises in the United States. 
. The times a chicken lays its eggs. 
. The times of airplane crashes in a worldwide registration. 


. The locations of worngly spelled words in a book. 


olka wt 


. The times of traffic accidents at a crossroad. 


12.2 The number of customers that visit a bank on a day is modeled by a 
Poisson distribution. It is known that the probability of no customers at all 
is 0.00001. What is the expected number of customers? 


12.3 Let N have a Pois(4) distribution. What is P(N = 4)? 


12.4 Let X have a Pois(2) distribution. What is P(X <1)? 


12.5 EJ The number of errors on a hard disk is modeled as a Poisson random 
variable with expectation one error in every Mb, that is, in every 2?° bytes. 


a. What is the probability of at least one error in a sector of 512 bytes? 


b. The hard disk is an 18.62-Gb disk drive with 39054015 sectors. What is 
the probability of at least one error on the hard disk? 


12.6 Exercises 177 


12.6 & A certain brand of copper wire has flaws about every 40 centimeters. 
Model the locations of the flaws as a Poisson process. What is the probability 
of two flaws in 1 meter of wire? 


12.7 H The Poisson model is sometimes used to study the flow of traffic ({15]). 
If the traffic can flow freely, it behaves like a Poisson process. A 20-minute 
time interval is divided into 10-second time slots. At a certain point along the 
highway the number of passing cars is registered for each 10-second time slot. 
Let n; be the number of slots in which j cars have passed for 7 = 0,...,9. 
Suppose that one finds 


jy O 1 2 3 45678 9 
nj; 19 38 28 20734001 


Note that the total number of cars passing in these 20 minutes is 230. 


a. What would you choose for the intensity parameter ? 

b. Suppose one estimates the probability of 0 cars passing in a 10-second 
time slot by no divided by the total number of time slots. Does that 
(reasonably) agree with the value that follows from your answer in a? 

c. What would you take for the probability that 10 cars pass in a 10-second 
time slot? 


12.8 EF] Let X be a Poisson random variable with parameter ju. 


a. Compute E[X(X — 1)]. 
b. Compute Var(X), using that Var(X) = E[X(X — 1)] + E[X] - (E[X])?. 


12.9 Let Y; and Y2 be independent Poisson random variables with parameter 
L41, respectively 2. Show that Y = Y; + Y2 also has a Poisson distribution. 
Instead of using the addition rule in Section 11.1 as in Exercise 11.2, you 
can prove this without doing any computations by considering the number 
of points of a Poisson process (with intensity 1) in two disjoint intervals of 
length py and ps2. 


12.10 Let X be a random variable with a Pois(y) distribution. Show the 
following. If » < 1, then the probabilities P(X = k) are strictly decreasing 
in k. If w > 1, then the probabilities P(X =k) are first increasing, then 
decreasing (cf. Figure 12.1). What happens if 4 = 1? 


12.11 H Consider the one-dimensional Poisson process with intensity A. Show 
that the number of points in [0,t], given that the number of points in (0, 2¢] 
is equal to n, has a Bin(n, 4) distribution. 

Hint: write the event {N((0,s]) = &, N({0,2s]) = n} as the intersection of the 


0, 
(independent!) events {N([0,s]) = k} and {N((s,2s]) =n — k}. 


178 12 The Poisson process 


12.12 We consider the one-dimensional Poisson process. Suppose for some 
a > 0 it is given that there are exactly two points in [0, a], or in other words: 
Nag = 2. The goal of this exercise is to determine the joint distribution of X, 
and X92, the locations of the two points, conditional on Ng = 2. 


a. Prove that for0<s<t<a 


P(X < 38,X2 <t,Na =2) 
= P(Xy <t,N, =2)—P(X1 > 8, X2 <t, Ny =2). 


b. Deduce from a that 


242 277. é\2 
PLY = oko Ht, So) ae (-~4*). 


c. Deduce from b that for0<s<t<a 


i? — (t—s)? 


P(X, < 8,Xo <t|No=2)= 5) 


a 
12.13 Walking through a meadow we encounter two kinds of flowers, daisies 
and dandelions. As we walk in a straight line, we model the positions of the 
flowers we encounter with a one-dimensional Poisson process with intensity A. 
It appears that about one in every four flowers is a daisy. Forgetting about 
the dandelions, what does the process of the daisies look like? This question 
will be answered with the following steps. 


a. Let N; be the total number of flowers, X; the number of daisies, and Y; 
be the number of dandelions we encounter during the first ¢ minutes of 
our walk. Note that X; + Y; = N;. Suppose that each flower is a daisy 
with probability 1/4, independent of the other flowers. Argue that 


Pesan yams nem =("%")()"@)” 


b. Show that 
P(X; = 7, Y, =m) = Zo) G) oars 


by conditioning on N; and using a. 


c. By writing e~*? = e~ O/4)te—@/4* and summing over m, show that 


1 t 
P(X: =n) = a ora (AE" 
Since it is clear that the numbers of daisies that we encounter in disjoint time 
intervals are independent, we may conclude from c that the process (X;) is 
again a Poisson process, with intensity \/4. One often says that the process 
(X;) is obtained by thinning the process (N;). In our example this corresponds 
to picking all the dandelions. 


12.6 Exercises 179 


12.14 LH In this exercise we look at a simple example of random variables X;, 
that have the property that their distributions converge to the distribution of 
a random variable X as n — oo, while it is not true that their expectations 
converge to the expectation of X. Let for n = 1,2,... the random variables 
X,y, be defined by 


1 1 
n n 


a. Let X be the random variable that is equal to 0 with probability 1. Show 
that for all a the probability mass functions px, (a) of the X,, converge to 
the probability mass function px (a) of X as n — oo. Note that E[X]=0. 


b. Show that nonetheless E[X,,] = 7 for all n. 


13 


The law of large numbers 


For many experiments and observations concerning natural phenomena—such 
as measuring the speed of light—one finds that performing the procedure twice 
under (what seem) identical conditions results in two different outcomes. Un- 
controllable factors cause “random” variation. In practice one tries to over- 
come this as follows: the experiment is repeated a number of times and the 
results are averaged in some way. In this chapter we will see why this works so 
well, using a model for repeated measurements. We view them as a sequence 
of independent random variables, each with the same unknown distribution. 
It is a probabilistic fact that from such a sequence—in principle—any feature 
of the distribution can be recovered. This is a consequence of the law of large 
numbers. 


13.1 Averages vary less 


Scientists and engineers involved in experimental work have known for cen- 
turies that more accurate answers are obtained when measurements or ex- 
periments are repeated a number of times and one averages the individual 
outcomes.! For example, if you read a description of A.A. Michelson’s work 
done in 1879 to determine the speed of light, you would find that for each 
value he collected, repeated measurements at several levels were performed. 
In an article in Statistical Science describing his work ([{18]), R.J. MacKay 
and R.W. Oldford state: “It is clear that Michelson appreciated the power 
of averaging to reduce variability in measurement.” We shall see that we can 
understand this reduction using only what we have learned so far about prob- 
ability in combination with a simple inequality called Chebyshev’s inequality. 


Throughout this chapter we consider a sequence of random variables X,, X2, 
X3,.... You should think of X; as the result of the ith repetition of a partic- 
ular measurement or experiment. We confine ourselves to the situation where 


' We leave the problem of systematic errors aside but will return to it in Chapter 19. 


182 13 The law of large numbers 


experimental conditions of subsequent experiments are identical, and the out- 
come of any one experiment does not influence the outcomes of others. Under 
those circumstances, the random variables of the sequence are independent, 
and all have the same distribution, and we therefore call X1, X2, X3,... an 
independent and identically distributed sequence. We shall denote the distri- 
bution function of each random variable X; by F,, its expectation by p, and 
the standard deviation by a. 


The average of the first n random variables in the sequence is 


a y+ AQ+ + 
n 


and using linearity of expectations we find: 


= 1 1 


— Lt. 
By the variance-of-the-sum rule, using the independence of X),..., Xn, 
Var (X. ) 5 lo hh )= Lt i egy 
. n? n ne " 


This establishes the following rule. 


EXPECTATION AND VARIANCE OF AN AVERAGE. If X,, is the average 
of n independent random variables with the same expectation ps and 
variance 07, then 


E [xX] =p and Var(X;) —— 


The expectation of X,, is again jz, and its standard deviation is less than that 
of a single X; by a factor \/n; the “typical distance” from pu is \/n smaller. 
The latter property is what Michelson used to gain accuracy. To illustrate 
this, we analyze an example. 


Suppose the random variables X,, X2,... are continuous with a Gam/(2,1) 
distribution, so with probability density: 


f(a) =ae* for x > 0. 


Recall from Section 11.2 that this means that each X; is distributed as the 
sum of two independent Exp(1) random variables. Hence, S, = X1+---+ Xn 
is distributed as the sum of 2n independent Ezp(1) random variables, which 
has a Gam(2n,1) distribution, with probability density 

g2n-le-# 


Isn(®) = Gyr for x > 0. 


13.2 Chebyshev’s inequality 183 


Because X,, = S,,/n, we find by applying the change-of-units rule (page 106): 
n (nx)?”~* eTne 
fx,, (2) = nfs, (nz) _ Cn for x > 0. 


This is the probability density of the Gam(2n,n) distribution. 

So we have determined the distribution of X,, explicitly and we can investigate 
what happens as n increases, for example, by plotting probability densities. 
In the left-hand column of Figure 13.1 you see plots of fx, for n = 1, 2, 4, 9, 
16, and 400 (note that for n = 1 this is just f itself). For comparison, we take 
as a second example a so-called bimodal density function: a density with two 
bumps, formally called modes. For the same values of n we determined the 
probability density function of X,, (unlike the previous example, we are not 
concerned with the computations, just with the results). The graphs of these 
densities are given side by side with the gamma densities in Figure 13.1. 
The graphs clearly show that, as n increases, there is “contraction” of the 
probability mass near the expected value yz (for the gamma densities this is 2, 
for the bimodal densities 2.625). 


QUICK EXERCISE 13.1 Compare the probabilities that X,, is within 0.5 of its 
expected value for n = 1, 4, 16, and 400. Do this for the gamma case only 
by estimating the probabilities from the graphs in the left-hand column of 
Figure 13.1. 


13.2 Chebyshev’s inequality 


The contraction of probability mass near the expectation is a consequence of 
the fact that, for any probability distribution, most probability mass is within 
a few standard deviations from the expectation. To show this we will employ 
the following tool, which provides a bound for the probability that the random 
variable Y is outside the interval (E[Y] — a, E[Y] + a). 


CHEBYSHEV’S INEQUALITY. For an arbitrary random variable Y 
and any a > 0: 


P(|¥ — E[Y]| > a) < Var(¥). 


We shall derive this inequality for continuous Y (the discrete case is similar). 
Let fy be the probability density function of Y. Let 4 denote E[Y]. Then: 


Var(Y) = /. (y — 4)" fy (y) dy > / (y — #)? fy (y) dy 


—oo ly—p|za 


=f @Pfeyay=@P(Y ~ p| >a). 
ly—pl>a 


184 13 The law of large numbers 


1.5 
1.0 


0.5 


0.0 


1.5 


1.0 


0.5 


0.0 


1.5 


1.0 


0.5 


0.0 


1.5 


1.0 


0.5 


0.0 


1.5 


1.0 


0.5 


0.0 


1.5 


1.0 


0.5 


0.0 


Fig. 13.1. Densities of averages. Left column: from a gamma density; right column: 
from a bimodal density. 


13.3 The law of large numbers 185 


Dividing both sides of the resulting inequality by a?, we obtain Chebyshev’s 
inequality. 

Denote Var(Y) by o? and consider the probability that Y is within a few 
standard deviations from its expectation pu: 


P(|Y — pl < ko) =1—P(\¥ — p| > ko), 
where & is a small integer. Setting a = ko in Chebyshev’s inequality, we find 


Var(Y 1 

ae =1-5. (13.1) 
For k = 2,3,4 the right-hand side is 3/4, 8/9, and 15/16, respectively. This 
suggests that with Chebyshev’s inequality we can make very strong state- 
ments. For most distributions, however, the actual value of P(|Y — pu] < ko) 
is even higher than the lower bound (13.1). We summarize this as a somewhat 
loose rule. 


P(|Y — | < ko) > 1 - 


THE “fw + A FEW o” RULE. Most of the probability mass of a 
random variable is within a few standard deviations from its expec- 
tation. 


QUICK EXERCISE 13.2 Calculate P(|Y — | < ko) exactly for k = 1,2,3,4 
when Y has an Exp(1) distribution and compare this with the bounds from 
Chebyshev’s inequality. 


13.3 The law of large numbers 


We return to the independent and identically distributed sequence of ran- 
dom variables X,,X2,... with expectation j and variance o?. We apply 
Chebyshev’s inequality to the average X,,, where we use E [Xn] = p and 
Var(Xn) = 0?/n, and where e > 0: 


o2 


= _ = 1 _ 
P(|Xn — »| > e) = P(|Xn —E[Xn]| >) < =z Var (Xn) = 5. 
The right-hand side vanishes as n goes to infinity, no matter how small ¢ is. 
This proves the following law. 


THE LAW OF LARGE NUMBERS. If X,, is the average of n independent 
random variables with expectation y and variance o?, then for any 
e > (0 

in PX |e) =o) 

n— co 


186 13 The law of large numbers 


A connection with experimental work 


Let us try to interpret the law of large numbers from an experimenter’s per- 
spective. Imagine you conduct a series of experiments. The experimental setup 
is complicated and your measurements vary quite a bit around the “true” value 
you are after. Suppose (unknown to you) your measurements have a gamma 
distribution, and its expectation is what you want to determine. You decide 
to do a certain number of measurements, say n, and to use their average as 
your estimate of the expectation. 


We can simulate all this, and Figure 13.2 shows the results of a simulation, 
where we chose the same Gam/(2,1) distribution, i.e., with expectation pu = 2. 
We anticipated that you might want to do as many as 500 measurements, so 
we generated realizations for X 1, X2, ..., Xs509. For each n we computed the 
average of the first n values and plotted these averages against n in Figure 13.2. 


Fig. 13.2. Averages of realizations of a sequence of gamma distributed random 
variables. 


If your decision is to do 200 repetitions, you would find (in this simulation) a 
value of about 2.09 (slightly too high, but you wouldn’t know!), whereas with 
n = 400 you would be almost exactly correct with 1.99, and with n = 500 
again a little farther away with 2.06. For another sequence of realizations, the 
details in the pattern that you see in Figure 13.2 would be different, but the 
general dampening of the oscillations would still be present. This follows from 
what we saw earlier, that as n is larger, the probability for the average to be 
within a certain distance of the expectation increases, in the limit even to 1. 
In practice it may happen that with a large number of repetitions your average 
is farther from the “true” value than with a smaller number of repetitions—if 
it is, then you had bad luck, because the odds are in your favor. 


13.3 The law of large numbers 187 


The averages may fail to converge 


The law of large numbers is valid if the expectation of the distribution F is 
finite. This is not always the case. For example, the Cauchy and some Pareto 
distributions have heavy tails: their probability densities do go to 0 as x 
becomes large, but (too) slowly.” On the left in Figure 13.3 you see the result 
of a simulation with Cau(2,1) random variables. As in the gamma case, the 
averages tend to go toward 2 (which is the point of symmetry of the Cau(2, 1) 
density), but once in a while a very large (positive or negative) realization of 
an X,; throws off the average. 


— an 7 
ra ~~ ee 
: \ SS 


of WN 
2 
aq A 
; ; 
; ae 
0 
0 100 200 300 400 500 0 100 200 300 400 500 


Fig. 13.3. Averages of realizations of a sequence of Cauchy (at left) and Pareto (at 
right) distributed random variables. 


On the right in Figure 13.3 the result of a simulation with a Par(0.99) distri- 
bution is shown. Its expectation is infinite. In the plot we see segments where 
the average “drifts downward,” separated by upward jumps, which correspond 
to X; with extremely large values. The effect of the jumps dominates: it can 
be shown that X,, grows beyond any level. 


You might think that these patterns are phenomena that occur because of 
the short length of the simulation and that in longer simulations they would 
disappear after some value of n. However, the patterns as described will con- 
tinue to occur and the results of a longer simulation, say to n = 5000, would 
not look any “better.” 


Remark 13.1 (There is a stronger law of large numbers). Even 
though it is a strong statement, the law of large numbers in this paragraph 
is more accurately known as the weak law of large numbers. A stronger 
result holds, the strong law of large numbers, which says that: 


? They represent two separate cases: the Cauchy expectation does not exist (see 
Remark 7.1) and the Par(a)’s expectation is +00 if a < 1 (see Section 7.2). 


188 13 The law of large numbers 


P( lim Xn =n) =i. 


This is also expressed as “as n goes to infinity, X, converges to uw with 
probability 1.” It is not easy to see, but it is true that the strong law is 
actually stronger. The conditions for the law of large numbers, as stated 
in this section, could be relaxed. They suffice for both versions of the law. 
The conditions can be weakened to a point where the weak law still follows 
from them, but the strong law does not anymore; the strong law requires 
the stronger conditions. 


13.4 Consequences of the law of large numbers 


We continue with the sequence X1, X2,... of independent random variables 
with distribution function F’. In the previous section we saw how we could 
recover the (unknown) expectation y from a realization of the sequence. We 
shall see that in fact we can recover any feature of the probability distribu- 
tion. In order to avoid unnecessary indices, as in E[X,] and P(X, € C), we 
introduce an additional random variable X that also has F as its distribution 
function. 


Recovering the probability of an event 


Suppose that, rather than being interested in pp = E[X], we want to know the 
probability of an event, for example, 


p=P(X €C), where C = (a,b] for some a < b. 
If you do not know this probability p, you would probably estimate it from 
how often the event {X; € C} occurs in the sequence. You would use the 
relative frequency of X; € C among X), ..., Xp: the number of times the 
set C was hit divided by n. Define for each 2: 


fl if X, EC, 
= Vth. SE AaeeiC, 


The random variable Y; indicates whether the corresponding X; hits the set C; 
it is called an indicator random variable. In general, an indicator random 
variable for an event A is a random variable that is 1 when A occurs and 0 
when A® occurs. Using this terminology, Y; is the indicator random variable 
of the event X; € C. Its expectation is given by 


E[Yi] =1-P(X; € C)+0-P(X; C) = P(X; EC) =P(X EC) =p. 


Using the Yj, the relative frequency is expressed as (Yj +Y2+---+Y,)/n = Yn. 
Note that the random variables Y,, Yo,... are independent; the X; form an in- 
dependent sequence, and Y; is determined from X; only (this is an application 
of the rule about propagation of independence; see page 126). 


13.4 Consequences of the law of large numbers 189 


The law of large numbers, with p in the role of j1, can now be applied to Yj; 
it is the average of n independent random variables with expectation p and 
variance p(1 — p), so 

lim P(\¥n —p|> €) =0 (13.2) 


n—- Co 


for any € > 0. By reasoning along the same lines as in the previous section, we 
see that from a long sequence of realizations we can get an accurate estimate 
of the probability p. 


Recovering the probability density function 


Consider the continuous case, where f is the probability density function 
corresponding with F', and now choose C = (a — h,a +h], for some (small) 
positive h. By equation (13.2), for large n: 


ath 
% apoBice Cy i Fe) den TAHA: (13.3) 
a—h 
This relationship suggests to estimate the probability density in a as follows: 
= Y, _ the number of times X; € C fori <n 
a n- the length of C 


In Figure 13.4 we have done so for h = 0.25 and two values of a: 2 and 4. 
Rather than plotting the estimate in just one point, we use the same value 
for the whole interval (a — h,a+h]. This results in a vertical bar, whose area 
corresponds to Yp: 


Yn = 
height - width = — -2h=Y,,. 
eig wid Oh 


These estimates are based on the realizations of 500 independent Gam (2, 1) 
distributed random variables. In order to be able to see how well things came 


0.4 
0.3 
0.2 


0.1 


"y Ae 


0 2 4 6 8 10 


Fig. 13.4. Estimating the density at two points. 


190 13 The law of large numbers 


out, the Gam(2,1) density function is shown as well; near a = 2 the estimate 
is very accurate, but around a = 4 it is a little too low. 


There really is no reason to derive estimated values around just a few points, 
as is done in Figure 13.4. We might as well cover the whole x-axis with a grid 
(with grid size 2h) and do the computation for each point in the grid, thus 
covering the axis with a series of bars. The resulting bar graph is called a 
histogram. Figure 13.5 shows the result for two sets of realizations. 


0.4 
0.3 
0.2 
0.1 


0.0 


0.4 
0.3 
0.2 
0.1 


0.0 


Fig. 13.5. Recovering the density function by way of histograms. 


The top graph is constructed from the same realizations as Figure 13.4 and 
the bottom graph is constructed from a new set of realizations. Both graphs 
match the general shape of the density, with some bumps and valleys that are 
particular for the corresponding set of realizations. In Chapters 15 and 17 we 
shall return to histograms and treat them more elaborately. 


QUICK EXERCISE 13.3 The height of the bar at « = 2 in the first histogram 
is 0.26. How many of the 500 realizations were between 1.75 and 2.25? 


13.6 Exercises 191 


13.5 Solutions to the quick exercises 


13.1 The answers you have found should be in the neighborhood of the fol- 
lowing exact values: 


n 1 4 16 400 
P(|Xn —p| < 0.5) 0.27 0.52 0.85 1.00 


13.2 Because Y has an Exp(1) distribution w = 1 and Var(Y) = 0? = 1; we 
find for k > 1: 


P(\Y — pl < ko) = PUY -1| <k) 
=P(l1-k<Y¥ <k+1)=P(¥ <k+1)=1-e%F". 


Using this formula and (13.1) we obtain the following numbers: 


k 1 2 3 4 
Lower bound from Chebyshev 0 0.750 0.889 0.938 
P(\Y —1| <k) 0.865 0.950 0.982 0.993 


13.3 The value of Y,, for this bar equals its area 0.26 - 0.5 = 0.13. The bar 
represents 13% of the values, or 0.13 - 500 = 65 realizations. 


13.6 Exercises 


13.1 Verify the “ta few o” rule as you did in Quick exercise 13.2 for the fol- 
lowing distributions: U(—1, 1), U(—a,a), N(0,1), N(u, 07), Par(3), Geo(1/2). 
Construct a table as in the answer to the quick exercise and enter a line for 
each distribution. 


13.2 H An accountant wants to simplify his bookkeeping by rounding amounts 
to the nearest integer, for example, rounding € 99.53 and € 100.46 both to 
€ 100. What is the cumulative effect of this if there are, say, 100 amounts? To 
study this we model the rounding errors by 100 independent U(—0.5, 0.5) ran- 
dom variables Xi, Xo, sey X00: 


a. Compute the expectation and the variance of the X;. 

b. Use Chebyshev’s inequality to compute an upper bound for the probability 
P(|X1 + Xo +---+ Xi00| > 10) that the cumulative rounding error X; + 
Xo +---+ Xio99 exceeds € 10. 


192 13 The law of large numbers 


13.3 Consider the situation of the previous exercise. A manager wants to 
know what happens to the mean absolute error 4+ 57", |X;| as n becomes 
large. What can you say about this, applying the law of large numbers? 


13.4 H Of the voters in Florida, a proportion p will vote for candidate G, 
and a proportion 1 — p will vote for candidate B. In an election poll a number 
of voters are asked for whom they will vote. Let X; be the indicator random 
variable for the event “the ith person interviewed will vote for G.” A model 
for the election poll is that the people to be interviewed are selected in such 
a way that the indicator random variables X,, X2,... are independent and 
have a Ber(p) distribution. 


a. Suppose we use X,, to predict p. According to Chebyshev’s inequality, how 
large should n be (how many people should be interviewed) such that the 
probability that X,, is within 0.2 of the “true” p is at least 0.9? 

Hint: solve this first for p = 1/2, and use that p(1 — p) < 1/4 for all 
O<p<l. 
b. Answer the same question, but now X,, should be within 0.1 of p. 


c. Answer the question from part a, but now the probability should be at 
least 0.95. 


d. If p > 1/2 candidate G wins; if X,, > 1/2 you predict that G will win. 
Find an n (as small as you can) such that the probability that you predict 
correctly is at least 0.9, if in fact p = 0.6. 


13.5 You are trying to determine the melting point of a new material, of 
which you have a large number of samples. For each sample that you measure 
you find a value close to the actual melting point c but corrupted with a 
measurement error. We model this with random variables: 


M,=c+U; 


where M; is the measured value in degree Kelvin, and U; is the occurring 
random error. It is known that E[U;] = 0 and Var(U;) = 3, for each i, and that 
we may consider the random variables M,, M2, ... independent. According 
to Chebyshev’s inequality, how many samples do you need to measure to be 
90% sure that the average of the measurements is within half a degree of c? 


13.6 E) The casino La bella Fortuna is for sale and you think you might want 
to buy it, but you want to know how much money you are going to make. All 
the present owner can tell you is that the roulette game Red or Black is played 
about 1000 times a night, 365 days a year. Each time it is played you have 
probability 19/37 of winning the player’s bet of € 1 and probability 18/37 of 
having to pay the player €1. 


Explain in detail why the law of large numbers can be used to determine the 
income of the casino, and determine how much it is. 


13.6 Exercises 193 


13.7 Let X1, X2,...be a sequence of independent and identically distributed 
random variables with distributions function F’. Define F;, as follows: for any a 
number of X; in (—oo, a] 
F,,(a) = ——————_.. 


n 


Consider a fixed and introduce the appropriate indicator random variables (as 
in Section 13.4). Compute their expectation and variance and show that the 
law of large numbers tells us that 


Jim, P(|F,(a) — F(a)| >) =0. 


13.8 FH] In Section 13.4 we described how the probability density function 
could be recovered from a sequence X,, X2, X3, .... We consider the 
Gam(2,1) probability density discussed in the main text and a histogram bar 
around the point a = 2. Then f(a) = f(2) = 2e~? = 0.27 and the estimate 
for f(2) is Y,/2h, where Y,, as in (13.3). 


a. Express the standard deviation of Y,,/2h in terms of n and h. 

b. Choose h = 0.25. How large should n be (according to Chebyshev’s in- 
equality) so that the estimate is within 20% of the “true value”, with 
probability 80%? 


13.9 H Let X1, Xo, ... be an independent sequence of U(—1,1) random 
variables and let T, = +)7i_, X?. It is claimed that for some a and any 
e>0 


lim P(|T, —a| >) =0. 


a. Explain how this could be true. 
b. Determine a. 


13.10 F) Let M,, be the maximum of n independent U(0,1) random variables. 


a. Derive the exact expression for P(|M,, — 1] > «). 
Hint: see Section 8.4. 


b. Show that lim, P(|M, — 1| > €) = 0. Can this be derived from Cheby- 
shev’s inequality or the law of large numbers? 


13.11 For some ¢ > 1, let X be a random variable taking the values 0 and f, 
with probabilities 


1 
P(X =0)=1-> and P(X=#)=E. 
Then E[X] = 1 and Var(X) = t—1. Consider the probability P(|X — 1| > a). 


a. Verify the following: if t = 10 and a = 8 then P(|X — 1| > a) = 1/10 and 
Chebyshev’s inequality gives an upper bound for this probability of 9/64. 
The difference is 9/64 — 1/10 * 0.04. We will say that for t = 10 the 
Chebyshev gap for X at a = 8 is 0.04. 


194 13 The law of large numbers 


b. Compute the Chebyshev gap for t = 10 at a= 5 and at a= 10. 
c. Can you find a gap smaller than 0.01, smaller than 0.001, smaller than 
0.0001? 


d. Do you think one could improve Chebyshev’s inequality, i.e., find an upper 
bound closer to the true probabilities? 


13.12 (A more general law of large numbers). Let X),X2,... bea 
sequence of independent random variables, with E[X;] = yw; and Var(X;) = 
o?, for i=1,2,.... Suppose that 0 < 0? < M, for all i. Let a be an arbitrary 
positive number. 


a. Apply Chebyshev’s inequality to show that 


b. Conclude from a that 


_ ieee 
lim P{ |X, —-—— i 
lim, ( 5 aH 


i=1 


>) =0 


Check that the law of large numbers is a special case of this result. 


14 


The central limit theorem 


The central limit theorem is a refinement of the law of large numbers. 
For a large number of independent identically distributed random variables 
X1,...,Xn, with finite variance, the average X,, approximately has a normal 
distribution, no matter what the distribution of the X; is. In the first section 
we discuss the proper normalization of X,, to obtain a normal distribution 
in the limit. In the second section we will use the central limit theorem to 
approximate probabilities of averages and sums of random variables. 


14.1 Standardizing averages 


In the previous chapter we saw that the law of large numbers guarantees 
the convergence to p of the average X,, of n independent random variables 
X1,...,Xn, all having the same expectation yz and variance 07. This conver- 
gence was illustrated by Figure 13.1. Closer examination of this figure suggests 
another phenomenon: for the two distributions considered (i.e., the Gam(2, 1) 
distribution and a bimodal distribution), the probability density function of 
X,, seems to become symmetrical and bell shaped around the expected value ju 
as n becomes larger and larger. However, the bell collapses into a single spike 
at p. Nevertheless, by a proper normalization it is possible to stabilize the 
bell shape, as we will see. 


In order to let the distribution of X,, settle down it seems to be a good idea 
to stabilize the expectation and variance. Since E [Xn] = p for all n, only the 
variance needs some special attention. In Figure 14.1 we depict the probability 
density function of the centered average X,,—pu of Gam/(2, 1) random variables, 
multiplied by three different powers of n. In the left column we display the 
density of (Xp, — 1), in the middle column the density of n2(Xp — ju), and 
in the right column the density of n(X, — 1). These figures suggest that \/n 
is the right factor to stabilize the bell shape. 


196 14 The central limit theorem 


0.4 


0.2 


0.0 


0.4 


0.2 


0.0 


0.4 


0.2 


0.0 


0.4 


0.2 


0.0 


0.4 


0.2 


0.0 


-3 -2 -1 0 1 2 3 -8 -2 -1 0 1 2 3 -8 -2 -1 0 1 2 38 


Fig. 14.1. Multiplying the difference X,, — pp of n Gam(2,1) random variables. Left 
column: nt (Xn — py); middle column: /n(Xn — p); right column: n(X,, — 1). 


14.1 Standardizing averages 197 


Indeed, according to the rule for the variance of an average (see page 182), 
we have Var (Xn) =o7/n, and therefore for any number C: 


2 
Var (C(Xn — #)) = Var(CXn) = C?Var(Xn) = P=. 
To stabilize the variance we therefore must choose C = ,/n. In fact, by choos- 
ing C = \/n/o, one standardizes the averages, i.e., the resulting random vari- 
able Z,,, defined by 
Xn - 
Cava ES . Ged Dy ses 
ol 


has expected value 0 and variance 1. What more can we say about the distri- 
bution of the random variables Z,,? 


In case X,,X2,... are independent N(,07) distributed random variables, 
we know from Section 11.2 and the rule on expectation and variance under 
change of units (see page 98), that Z,, has an N(0,1) distribution for all n. For 
the gamma and bimodal random variables from Section 13.1 we depicted the 
probability density function of Z,, in Figure 14.2. For both examples we see 
that the probability density functions of the Z,, seem to converge to the prob- 
ability density function of the N(0,1) distribution, indicated by the dotted 
line. The following amazing result states that this behavior generally occurs 
no matter what distribution we start with. 


THE CENTRAL LIMIT THEOREM. Let Xj, X2,... be any sequence 
of independent identically distributed random variables with finite 


positive variance. Let jz be the expected value and o? the variance 
of each of the X;. For n > 1, let Z,, be defined by 


oe 
— 
oO 


then for any number a 


lim Fz, (a) = ®(a), 


n—oo 


where ® is the distribution function of the N(0,1) distribution. In 
words: the distribution function of Z, converges to the distribution 
function ® of the standard normal distribution. 


Note that 
Var (Xa) 


which is a more direct way to see that Z,, is the average X,, standardized. 


198 14 The central limit theorem 


1.0 
0.8 
0.6 
0.4 
0.2 


0.0 


1.0 
0.8 
0.6 
0.4 
0.2 


0.0 


1.0 
0.8 
0.6 
0.4 
0.2 


0.0 


1.0 
0.8 
0.6 
0.4 
0.2 


0.0 


1.0 
0.8 
0.6 
0.4 
0.2 


0.0 


Fig. 14.2. Densities of standardized averages Z,. Left column: from a gamma den- 
sity; right column: from a bimodal density. Dotted line: N(0,1) probability density. 


14.2 Applications of the central limit theorem 199 


One can also write Z, as a standardized sum 


Xypte:- + Xp — ny 

Ly = —————————_——. 
avn 

In the next section we will see that this last representation of Z, is very 


helpful when one wants to approximate probabilities of sums of independent 
identically distributed random variables. 


(14.1) 


Since 7 . 
Xn = cae? + H, 

it follows that X, approximately has an N(,0?/n) distribution; see the 

change-of-units rule for normal random variables on page 106. This explains 


the symmetrical bell shape of the probability densities in Figure 13.1. 


Remark 14.1 (Some history). Originally, the central limit theorem was 
proved in 1733 by De Moivre for independent Ber(3) distributed random 
variables. Lagrange extended De Moivre’s result to Ber(p) random variables 
and later formulated the central limit theorem as stated above. Around 
1901 a first rigorous proof of this result was given by Lyapunov. Several 
versions of the central limit theorem exist with weaker conditions than those 
presented here. For example, for applications it is interesting that it is not 
necessary that all X; have the same distribution; see Ross [26], Section 8.3, 
or Feller [8], Section 8.4, and Billingsley [3], Section 27. 


14.2 Applications of the central limit theorem 


The central limit theorem provides a tool to approximate the probability 
distribution of the average or the sum of independent identically distributed 
random variables. This plays an important role in applications, for instance, 
see Sections 23.4, 24.1, 26.2, and 27.2. Here we will illustrate the use of the 
central limit theorem to approximate probabilities of averages and sums of 
random variables in three examples. The first example deals with an average; 
the other two concern sums of random variables. 


Did we have bad luck? 


In the example in Section 13.3 averages of independent Gam/(2,1) distributed 
random variables were simulated for n = 1,...,500. In Figure 13.2 the realiza- 
tion of X,, for n = 400 is 1.99, which is almost exactly equal to the expected 
value 2. For n = 500 the simulation was 2.06, a little bit farther away. Did 
we have bad luck, or is a value 2.06 or higher not unusual? To answer this 
question we want to compute Pix > 2.06). We will find an approximation 
of this probability using the central limit theorem. 


200 14 The central limit theorem 
Note that 


P( Xn > 2.06) = P(X, — # > 2.06 — i) 


A 2 yattiae), 


Since the X; are Gam(2,1) random variables, p = E[X;] = 2 and o? = 
Var(X;) = 2. We find for n = 500 that 


7 2.06 — 2 
P(Xs00 > 2.06) = P (Ze > vim) 


= P(Zspo > 0.95) 
=1- P(Zs500 < 0.95) : 


It now follows from the central limit theorem that 
P(X500 > 2.06) x 1— ©(0.95) = 0.1711. 


This is close to the exact answer 0.1710881, which was obtained using the 
probability density of X,, as given in Section 13.1. 


Thus we see that there is about a 17% probability that the average X590 is at 
least 0.06 above 2. Since 17% is quite large, we conclude that the value 2.06 
is not unusual. In other words, we did not have bad luck; n = 500 is simply 
not large enough to be that close. Would 2.06 be unusual if n = 5000? 


QUICK EXERCISE 14.1 Show that P(Xs5000 > 2.06) * 0.0013, using the central 
limit theorem. 


Rounding amounts to the nearest integer 


In Exercise 13.2 an accountant wanted to simplify his bookkeeping by round- 
ing amounts to the nearest integer, and you were asked to use Chebyshev’s 
inequality to compute an upper bound for the probability 


= P(|xX, + Xo +-+-+ X10 > 10) 


that the cumulative rounding error X; + X2 +---+ Xi09 exceeds € 10. This 
upper bound equals 1/12. In order to know the exact value of p one has to 
determine the distribution of the sum X,+---+Xj009. This is difficult, but the 
central limit theorem is a handy tool to get an approximation of p. Clearly, 


p= P(X, +---+ Xi00 < —10) + P(X +--+-+ Xi00 > 10). 


Standardizing as in (14.1), for the second probability we write, with n = 100 


14.2 Applications of the central limit theorem 201 


P(X, 4+ --- +X, > 10) = P(X, +--+ +X, — np > 10-— np) 
_p Xyte:-+Xy-— np S 10 — np 
7 a/n a/n 
=P(Z, > ae) 
oV/n 
The X; are U(—0.5,0.5) random variables, p = E[X;] = 0, and o? = 
Var(X;) = 1/12, so that 


10 — 100-0 
/1/12/100 


It follows from the central limit theorem that 


P(X, +-+++ X19 > 10) = P (Zn > = P(Zi00 > 3.46). 


P(Zi99 > 3.46) 1 — (3.46) = 0.0003. 


Similarly, 
P(X, +--+ + X00 < —10) » 6(—3.46) = 0.0003. 


Thus we find that p = 0.0006. 


Normal approximation of the binomial distribution 


In Section 4.3 we considered the (fictitious) situation that you attend, com- 
pletely unprepared, a multiple-choice exam consisting of 10 questions. We saw 
that the probability you will pass equals 


P(X > 6) = 0.0197, 


where X—being the sum of 10 independent Ber(+) random variables—has 
a Bin(10, 4) distribution. As we saw in Chapter 4 it is rather easy, but te- 
dious, to calculate P(X > 6). Although n is small, we investigate what the 
central limit theorem will yield as an approximation of P(X > 6). Recall that 
a random variable with a Bin(n,p) distribution can be written as the sum of 
n independent Ber(p) distributed random variables R1,..., Rn. Substituting 
n= 10, u=p=1/4, and o? = p(1 — p) = 3/16, it follows from the central 
limit theorem that 


P(X >6) =P(Ri +--+ Rn > 6) 


_ fits + fn — mH LS 6 my 
7 oVn ~~ orv/n 
1 
P| Z ae 
= 10 2 
#10 


= 1 — 8(2.56) = 0.0052. 


202 14 The central limit theorem 


The number 0.0052 is quite a poor approximation for the true value 0.0197. 
Note however, that we could also argue that 


P(X >6) =P(X >5) 
= P(Ri +-+-+ Rn > 5) 
5-22 
3v10 
= 1 — 6(1.83) = 0.0336, 


=P 202 


which gives an approximation that is too large! A better approach lies some- 
where in the middle, as the following quick exercise illustrates. 


QUICK EXERCISE 14.2 Apply the central limit theorem to find 0.0143 as an ap- 
proximation to P(X > 54). Since P(X > 6) = P(X > 53), this also provides 
an approximation of P(X > 6). 


How large should n be? 


In view of the previous examples one might raise the question of how large n 
should be to have a good approximation when using the central limit theorem. 
In other words, how fast is the convergence to the normal distribution? This 
is a difficult question to answer in general. For instance, in the third example 
one might initially be tempted to think that the approximation was quite 
poor, but after taking the fact into account that we approximate a discrete 
distribution by a continuous one we obtain a considerable improvement of the 
approximation, as was illustrated in Quick exercise 14.2. For another example, 
see Figure 14.2. Here we see that the convergence is slightly faster for the 
bimodal distribution than for the Gam(2,1) distribution, which is due to the 
fact that the Gam(2,1) is rather asymmetric. 


In general the approximation might be poor when n is small, when the dis- 
tribution of the X; is asymmetric, bimodal, or discrete, or when the value a 
in 7 

Pi Xe Sa) 
is far from the center of the distribution of the X;. 


14.3 Solutions to the quick exercises 


14.1 In the same way we approximated Pix, > 2.06) using the central limit 
theorem, we have that 


P(Xy > 2.06) = P( Zs, > yn ty. 


14.4 Exercises 203 
With p = 2 and o = V2, we find for n = 5000 that 
P(Xs000 > 2.06) = P(Zs000 > 3), 


which is approximately equal to 1 — (3) = 0.0013, thanks to the central limit 
theorem. Because we think that 0.13% is a small probability, to find 2.06 as 
a value for X5999 would mean that you really had bad luck! 


14.2 Similar to the computation P(X > 6), we have 


1 1 


3 V10 
~ 1 — 6(2.19) = 0.0143. 


We have seen that using the central limit theorem to approximate P(X > 6) 
gives an underestimate of this probability, while using the central limit the- 
orem to P(X > 5) gives an overestimation. Since 53 is “in the middle,” the 
approximation will be better. 


14.4 Exercises 


14.1 Let X 1, X2,...,X144 be independent identically distributed random 
variables, each with expected value p = E[X;] = 2, and variance 0? = 
Var(X;) = 4. Approximate P(X, + X2+---+ X44 > 144), using the central 


limit theorem. 


14.2 © Let X 1, X2,...,X625 be independent identically distributed random 
variables, with probability density function f given by 


3(1—2)? for0<2<1, 
f(x) = 
0 otherwise. 


Use the central limit theorem to approximate P(X; + Xo +---+ X25 < 170). 


14.3 H In Exercise 13.4 a you were asked to use Chebyshev’s inequality to 
determine how large n should be (how many people should be interviewed) so 
that the probability that X,, is within 0.2 of the “true” p is at least 0.9. Here 
p is the proportion of the voters in Florida who will vote for G (and 1 — p is 
the proportion of the voters who will vote for B). How large should n at least 
be according to the central limit theorem? 


204 14 The central limit theorem 


14.4 © In the single-server queue model from Section 6.4, T; is the time 
between the arrival of the (¢ — 1)th and ith customers. Furthermore, one 
of the model assumptions is that the T; are independent, Exp(0.5) dis- 
tributed random variables. In Section 11.2 we saw that the probability 
P(T, +---+T30 < 60) of the 30th customer arriving within an hour at the 
well is equal to 0.542. Find the normal approximation of this probability. 


14.5 H Let X be a Bin(n,p) distributed random variable. Show that the 


random variable 
X — np 


np(1— p) 


has a distribution that is approximately standard normal. 


14.6 LF) Again, as in the previous exercise, let X be a Bin(n,p) distributed 
random variable. 


a. An exact computation yields that P(X < 25) = 0.55347, when n = 100 
and p = 1/4. Use the central limit theorem to give an approximation of 
P(X < 25) and P(X < 26). 

b. When n = 100 and p = 1/4, then P(X < 2) = 1.87-1071°. Use the central 
limit theorem to give an approximation of this probability. 


14.7 Let X1, Xo,...,X, be n independent random variables, each with ex- 
pected value jz and finite positive variance 77. Use Chebyshev’s inequality to 
show that for any a > 0 one has 


P Xn =H S45) 224 
n a - 
a ~~ aty/n 


Use this fact to explain the occurrence of a single spike in the left column of 
Figure 14.1. 


14.8 Let X1, X2,... be a sequence of independent N(0, 1) distributed random 
variables. For n = 1,2,..., let Y;, be the random variable, defined by 


Yn = XP +--+ + X?. 


a. Show that E[X?] = 1. 


b. One can show—using integration by parts—that E [x4] = 3. Deduce from 
this that Var(X?) = 2. 
c. Use the central limit theorem to approximate P(Yjo9 > 110). 


14.9 & A factory produces links for heavy metal chains. The research lab of 
the factory models the length (in cm) of a link by the random variable X, 
with expected value E[X] = 5 and variance Var(X) = 0.04. The length of a 
link is defined in such a way that the length of a chain is equal to the sum of 


14.4 Exercises 205 


the lengths of its links. The factory sells chains of 50 meters; to be on the safe 
side 1002 links are used for such chains. The factory guarantees that the chain 
is not shorter than 50 meters. If by chance a chain is too short, the customer 
is reimbursed, and a new chain is given for free. 


a. 


Give an estimate of the probability that for a chain of at least 50 meters 
more than 1002 links are needed. For what percentage of the chains does 
the factory have to reimburse clients and provide free chains? 


. The sales department of the factory notices that it has to hand out a 


lot of free chains and asks the research lab what is wrong. After further 
investigations the research lab reports to the sales department that the 
expectation value 5 is incorrect, and that the correct value is 4.99 (cm). 
Do you think that it was necessary to report such a minor change of this 
value? 


14.10 Chebyshev’s inequality was used in Exercise 13.5 to determine how 
many times n one needs to measure a sample to be 90% sure that the average 
of the measurements is within half a degree of the actual melting point c of a 
new material. 


a. 
b. 


Use the normal approximation to find a less conservative value for n. 


Only in case the random errors U; in the measurements have a normal 
distribution the value of n from ais “exact,” in all other cases an approx- 
imation. Explain this. 


15 


Exploratory data analysis: graphical summaries 


In the previous chapters we focused on probability models to describe random 
phenomena. Confronted with a new phenomenon, we want to learn about the 
randomness that is associated with it. It is common to conduct an experiment 
for this purpose and record observations concerning the phenomenon. The set 
of observations is called a dataset. By exploring the dataset we can gain insight 
into what probability model suits the phenomenon. 


Frequently you will have to deal with a dataset that contains so many ele- 
ments that it is necessary to condense the data for easy visual comprehension 
of general characteristics. In this chapter we present several graphical methods 
to do so. To graphically represent univariate datasets, consisting of repeated 
measurements of one particular quantity, we discuss the classical histogram, 
the more recently introduced kernel density estimates and the empirical dis- 
tribution function. To represent a bivariate dataset, which consists of repeated 
measurements of two quantities, we use the scatterplot. 


15.1 Example: the Old Faithful data 


The Old Faithful geyser at Yellowstone National Park, Wyoming, USA, was 
observed from August 1st to August 15th, 1985. During that time, data were 
collected on the duration of eruptions. There were 272 eruptions observed, of 
which the recorded durations are listed in Table 15.1. The data are given in 
seconds. 


The variety in the lengths of the eruptions indicates that randomness is in- 
volved. By exploring the dataset we might learn about this randomness. For 
instance: we like to know which durations are more likely to occur than others; 
is there something like “the typical duration of an eruption”; do the durations 
vary symmetrically around the center of the dataset; and so on. In order to 
retrieve this type of information, just listing the observed durations does not 
help us very much. Somehow we must summarize the observed data. We could 


208 15 Exploratory data analysis: graphical summaries 


Table 15.1. Duration in seconds of 272 eruptions of the Old Faithful geyser. 


216 108 200 137 272 173 282 216 117 261 
110 235 252 105 282 130 105 288 96 255 
108 105 207 184 272 216 118 245 231 266 
258 268 202 242 230 121 112 290 110 287 
261 113 274 105 272 199 230 126 278 120 
288 283 110 290 104 293 223 100 274 259 
134 270 105 288 109 264 250 282 124 282 
242 118 270 240 119 304 121 274 233 216 
248 260 246 158 244 296 237 271 130 240 
132 260 112 289 110 258 280 225 112 294 
149 262 126 270 243 112 282 107 291 221 
284 138 294 265 102 278 139 276 109 265 
157 244 255 118 276 226 115 270 136 279 
112 250 168 260 110 263 113 296 122 224 
254 134 272 289 260 119 278 121 306 108 
302 240 144 276 214 240 270 245 108 238 
132 249 120 230 210 275 142 300 116 277 
115 125 275 200 250 260 270 145 240 250 
113) 275 255 226 122 266 245 110 265 131 
288 110 288 246 238 254 210 262 135 280 
126 261 248 112 276 107 262 231 116 270 
143) 282 112 230 205 254 144 288 120 249 
112 256 105 269 240 247 245 256 235 273 
245 145 251 133 267 113 111 257 237 140 
249 141 296 174 275 230 125 262 128 261 
132 267 214 270 249 229 235 267 120 257 
286 272 111 255 119 135 285 247 129 265 
109 268 


Source: W. Hardle. Smoothing techniques with implementation in S. 1991; 
Table 3, page 201. © Springer New York. 


start by computing the mean of the data, which is 209.3 for the Old Faithful 
data. However, this is a poor summary of the dataset, because there is a lot 
more information in the observed durations. How do we get hold of this? 


Just staring at the dataset for a while tells us very little. To see something, 
we have to rearrange the data somehow. The first thing we could do is order 
the data. The result is shown in Table 15.2. Putting the elements in order 
already provides more information. For instance, it is now immediately clear 
that all elements lie between 96 and 306. 


QUICK EXERCISE 15.1 Which two elements of the Old Faithful dataset split 
the dataset in three groups of equal size? 


A closer look at the ordered data shows that the two middle elements (the 
136th and 137th elements in ascending order) are equal to 240, which is much 
closer to the maximum value 306 than to the minimum value 96. This seems to 


15.2 Histograms 209 


Table 15.2. Ordered durations of eruptions of the Old Faithful geyser. 


96 100 102 104 105 105 105 105 105 105 
107 107 108 108 108 108 109 109 109 110 
110 110 110 110 110 110 111 111 112 112 
112) 112 112 112 112 112 113 #113 #1138 #113 
115 115 116 116 117 118 118 118 119 119 
119 120 120 120 120 121 121 121 122 122 
124 125 125 126 126 126 128 129 130 130 
131 132 132 182 133 134 134 135 135 136 
137 1388 1389 140 141 142 143 144 144 145 
145 149 157 158 168 173 174 184 199 200 
200 202 205 207 210 210 214 214 216 216 
216 216 221 223 224 225 226 226 229 230 
230 230 230 230 231 231 233 235 235 235 
237 237 238 238 240 240 240 240 240 240 
242 242 243 244 244 245 245 245 245 245 
246 246 247 247 248 248 249 249 249 249 
250 250 250 250 251 252 254 254 254 255 
255 255 255 256 256 257 257 258 258 259 
260 260 260 260 260 261 261 261 261 262 
262 262 262 263 264 265 265 265 265 266 
266 267 267 267 268 268 269 270 270 270 
270 270 270 270 270 271 272 272 272 272 
272 273 274 274 274 275 275 275 275 276 
276 276 276 277 278 278 278 279 280 280 
282 282 282 282 282 282 283 284 285 286 
287 288 288 288 288 288 288 289 289 290 
290 291 293 294 294 296 296 296 300 302 
304 306 


indicate that the dataset is somewhat asymmetric, but even from the ordered 
dataset we cannot get a clear picture of this asymmetry. Also, geologists be- 
lieve that there are two different kinds of eruptions that play a role. Hence one 
would expect two separate values around which the elements of the dataset 
would accumulate, corresponding to the typical durations of the two types of 
eruptions. Again it is not clear, not even from the ordered dataset, what these 
two typical values are. It would be better to have a plot of the dataset that 
reflects symmetry or asymmetry of the data and from which we can easily see 
where the elements accumulate. In the following sections we will discuss two 
such methods. 


15.2 Histograms 


The classical method to graphically represent data is the histogram, which 
probably dates from the mortality studies of John Graunt in 1662 (see West- 


210 15 Exploratory data analysis: graphical summaries 


ergaard [39], p.22). The term histogram appears to have been used first by 
Karl Pearson ((22]). Figure 15.1 displays a histogram of the Old Faithful data. 
The picture immediately reveals the asymmetry of the dataset and the fact 
that the elements accumulate somewhere near 120 and 270, which was not 
clear from Tables 15.1 and 15.2. 


0.010 
0.008 
0.006 
0.004 


0.002 


60 120 180 240 300 360 


Fig. 15.1. Histogram of the Old Faithful data. 


The construction of the histogram is as follows. Let us denote a generic (uni- 
variate) dataset of size n by 


U1, UQ,-++,0n 


and suppose we want to construct a histogram. We use the version of the 
histogram that is scaled in such a way that the total area under the curve is 
equal to one.! 


First we divide the range of the data into intervals. These intervals are called 
bins and are denoted by 
By, Bo,..., Bm. 


The length of an interval B; is denoted by |B;| and is called the bin width. 
The bins do not necessarily have the same width. In Figure 15.1 we have eight 
bins of equal bin width. We want the area under the histogram on each bin 
B; to reflect the number of elements in B;. Since the total area 1 under the 
histogram then corresponds to the total number of elements n in the dataset, 
the area under the histogram on a bin B; is equal to the proportion of elements 
in B;: 

the number of x; in B; 

—— 

' The reason to scale the histogram so that the total area under the curve is equal to 


one is that if we view the data as being generated from some unknown probability 
density f (see Chapter 17), such a histogram can be used as a crude estimate of f. 


15.2 Histograms 211 


The height of the histogram on bin B; must then be equal to 


the number of x; in B; 
n| Bil 


QUICK EXERCISE 15.2 Use Table 15.2 to count how many elements fall into 
each of the bins (90, 120], (120, 150], ..., (300, 330] in Figure 15.1 and com- 
pute the height on each bin. 


Choice of the bin width 


Consider a histogram with bins of equal width. In that case the bins are of 
the form 
B, =(r+(¢—1)b,r+%b) fori =1,2,...,m 


? 


where r is some reference point smaller than the minimum of the dataset, 
and b denotes the bin width. In Figure 15.2, three histograms of the Old 
Faithful data of Table 15.2 are displayed with bin widths equal to 2, 30, and 
90, respectively. Clearly, the choice of the bin width 6, or the corresponding 
choice of the number of bins m, will determine what the resulting histogram 
will look like. Choosing the bin width too small will result in a chaotic figure 
with many isolated peaks. Choosing the bin width too large will result in a 
figure without much detail, at the risk of losing information about general 
characteristics. In Figure 15.2, bin width b = 2 is somewhat too small. Bin 
width b = 90 is clearly too large and produces a histogram that no longer 
captures the fact that the data show two separate modes near 120 and 270. 


How does one go about choosing the bin width? In practice, this might boil 
down to picking the bin width by trial and error, continuing until the figure 
looks reasonable. Mathematical research, however, has provided some guide- 
lines for a data-based choice for b or m. Formulas that may effectively be 
used are m = 1+ 3.3 logy9(n) (see [34]) or b = 3.49 sn—1/ (see [29]; see also 
Remark 15.1), where s is the sample standard deviation (see Section 16.2 for 
the definition of the sample standard deviation). 


0.01 | | 0.01 0.01 


60 180 300 60 180 300 60 180 300 
Bin width 2 Bin width 30 Bin width 90 
Fig. 15.2. Histograms of the Old Faithful data with different bin widths. 


212 15 Exploratory data analysis: graphical summaries 


Remark 15.1 (Normal reference method for histograms). Let 
H,,(x) denote the height of the histogram at x and suppose that we view our 
dataset as being generated from a probability distribution with density f. 
We would like to find the bin width that minimizes the difference between 
H,, and f, measured by the so-called mean integrated squared error (MISE) 


B| fala) ~ F(a)? ae] 
Under suitable smoothness conditions on f, the value of b that minimizes 
the MISE as n goes to infinity is given by 


co -1/3 
b=C(f)n-'/? where C(f) = 6" (/ f(a)? ae) 


(see for instance [29] or [12]). A simple data-based choice for b is obtained by 
estimating the constant C(f). The normal reference method takes f to be 
the density of an N(j,07) distribution, in which case C(f) = (24V7)1/°o. 
Estimating o by the sample standard deviation s (see Chapter 16 for a 
definition of s) would result in bin width 


b= (24/7) sn71/*, 
For the Old Faithful data this would give b = 36.89. 


QUICK EXERCISE 15.3 If we construct a histogram for the Old Faithful data 
with equal bin width b = 3.49 sn—!/3, how may bins will we need to cover the 
data if s = 68.48? 


The main advantage of the histogram is that it is simple. Its disadvantage is 
the discrete character of the plot. In Figure 15.1 it is still somewhat unclear 
which two values correspond to the typical durations of the two types of 
eruptions. Another well-known artifact is that changing the bin width slightly 
or keeping the bin width fixed and shifting the bins slightly may result in a 
figure of a different nature. A method that produces a smoother figure and is 
less sensitive to these kinds of changes will be discussed in the next section. 


15.3 Kernel density estimates 


We can graphically represent data in a more variegated plot by a so-called 
kernel density estimate. The basic ideas of kernel density estimation first ap- 
peared in the early 1950s. Rosenblatt [25] and Parzen [21] provided the stim- 
ulus for further research on this topic. Although the method was introduced 
in the middle of the last century, until recently it remained unpopular as a 
tool for practitioners because of its computationally intensive nature. 


Figure 15.3 displays a kernel density estimate of the Old Faithful data. Again 
the picture immediately reveals the asymmetry of the dataset, but it is much 


15.3 Kernel density estimates 213 


0.010 


0.008 

0.006 

0.004 / 

0.002 \ 
0 


60 120 180 240 300 360 


Fig. 15.3. Kernel density estimate of the Old Faithful data. 


smoother than the histogram in Figure 15.1. Note that it is now easier to 
detect the two typical values around which the elements accumulate. 


The idea behind the construction of the plot is to “put a pile of sand” around 
each element of the dataset. At places where the elements accumulate, the 
sand will pile up. The actual plot is constructed by choosing a kernel kK and 
a bandwidth h. The kernel K reflects the shape of the piles of sand, whereas 
the bandwidth is a tuning parameter that determines how wide the piles 
of sand will be. Formally, a kernel K is a function K : R — R. Figure 15.4 
displays several well-known kernels. A kernel K typically satisfies the following 
conditions: 


(K1) K is a probability density, ie., K(u) >0 and [7 .. 
(K2) K is symmetric around zero, i.e., K(u) = K(—u); 
(K3) K(u) = 0 for |u| > 1. 


Examples are the Epanechnikov kernel: 


K(u) = (1-w) for -l<u<1l 


and K(u) = 0 elsewhere, and the triweight kernel 


35 
K(u) = 5 (1—a”)* for -1<u<l 
and K(u) = 0 elsewhere. Sometimes one uses kernels that do not satisfy 
condition (K3), for example, the normal kernel 
K(u) 1h fi <u< 
u) = — =e or —o <u<o. 
V2n 
Let us denote a kernel density estimate by fy, and suppose that we want to 
construct fr, for a dataset 71, @2,...,%p. In Figure 15.5 the construction is 


214 15 Exploratory data analysis: graphical summaries 


1.2 12 1.2 
0.8 0.8 0.8 
0.4 0.4 0.4 
0.0 0.0 0.0 
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 #1 2 
Triangular kernel Cosine kernel Epanechnikov kernel 
1.2 1.2 1.2 
0.8 0.8 0.8 
0.4 0.4 0.4 ria 
0.0 0.0 0.0 
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 
Biweight kernel Triweight kernel Normal kernel 


Fig. 15.4. Examples of well-known kernels Kk. 


illustrated for a dataset containing five elements, where we use the Epanech- 
nikov kernel and bandwidth h = 0.5. First we scale the kernel K (solid line) 


into the function ; 
t 
tr—-kK[(-). 
5 (5) 


The scaled kernel (dotted line) is of the same type as the original kernel, with 
area 1 under the curve but is positive on the interval [—h, h] instead of [—1, 1] 
and higher (lower) when h is smaller (larger) than 1. Next, we put a scaled 
kernel around each element x; in the dataset. This results in functions of the 


type 
1 t— Xj 
t —Kk : 
ew, 


These shifted kernels (dotted lines) have the same shape as the transformed 
kernel, all with area 1 under the curve, but they are now symmetric around 
x; and positive on the interval [x; — h,x; +h]. We see that the graphs of the 
shifted kernels will overlap whenever x; and 2; are close to each other, so 
that things will pile up more at places where more elements accumulate. The 
kernel density estimate f;,,;, is constructed by summing the scaled kernels and 
dividing them by n, in order to obtain area 1 under the curve: 


15.3 Kernel density estimates 215 


Kernel and scaled kernel Shifted kernel Kernel density estimate 


Fig. 15.5. Construction of a kernel density estimate fn. 


fant) = 5 ink (52) +7K (=) re (==)} 


or briefly, 


n= — K (5) (15.1) 
1 


When computing f,,n(t), we assign higher weights to observations x; closer to 
t, in contrast to the histogram where we simply count the number of observa- 
tions in the bin that contains t. Note that as a consequence of condition (K1), 
fn,n itself is a probability density: 


fn,n(t) = 0 and /. Todt =, 


QUICK EXERCISE 15.4 Check that the total area under the kernel density 
estimate is equal to one, i.e., show that fie Fn.n(t) dt = 1. 


Note that computing f,,, is very computationally intensive. Its common use 
nowadays is therefore a typical product of the recent developments in com- 
puter hardware, despite the fact that the method was introduced much earlier. 


Choice of the bandwidth 


The bandwidth h plays the same role for kernel density estimates as the bin 
width b does for histograms. In Figure 15.6 three kernel density estimates of 
the Old Faithful data are plotted with the triweight kernel and bandwidths 
1.8, 18, and 180. It is clear that the choice of the bandwidth h determines 
largely what the resulting kernel density estimate will look like. Choosing the 
bandwidth too small will produce a curve with many isolated peaks. Choosing 
the bandwidth too large will produce a very smooth curve, at the risk of 
smoothing away important features of the data. In Figure 15.6 bandwidth 


216 15 Exploratory data analysis: graphical summaries 


h = 1.8 is somewhat too small. Bandwidth h = 180 is clearly too large and 
produces an oversmoothed kernel density estimate that no longer captures the 
fact that the data show two separate modes. 


0.01 N 0.01 
0 L 0 
60 180 300 60 180 300 60 180 300 


Bandwidth 1.8 Bandwidth 18 Bandwidth 180 
Fig. 15.6. Kernel estimates of the Old Faithful data. 


0.01 


How does one go about choosing the bandwidth? Similar to histograms, in 
practice one could do this by trial and error and continue until one obtains 
a reasonable picture. Recent research, however, has provided some guidelines 
for a data-based choice of h. A formula that may effectively be used is h = 
1.06 sn—'/5, where s denotes the sample standard deviation (see, for instance, 
[31]; see also Remark 15.2). 


Remark 15.2 (Normal reference method for kernel estimates). 
Suppose we view our dataset as being generated from a probability dis- 
tribution with density f. Let K be a fixed chosen kernel and let fn, be 
the kernel density estimate. We would like to take the bandwidth that min- 
imizes the difference between fn, and f, measured by the so-called mean 
integrated squared error (MISE) 


E [ Goate) = se)? ae]. 


Under suitable smoothness conditions on f, the value of h that minimizes 
the MISE, as n goes to infinity, is given by 


h=Oi(f)C2(K)n*°, 


where the constants C;(f) and C2(K) are given by 


1/5 
Osa K(u)? du) 
= 2/5" 
Gs u2 K (wu) du) 
After choosing the kernel kK, one can compute the constant C'2(4) to obtain 


a simple data-based choice for h by estimating the constant Ci(f). For 
instance, for the normal kernel one finds C2(K) = (2/m)7'/°. As with 


1/5 
1 
Ci(f) = a) and C2(K) = 


15.3 Kernel density estimates 217 


histograms (see Remark 15.1), the normal reference method takes f to be 
the density of an N(, 07) distribution, in which case Ci(f) = (8,/7/3)'/°o. 
Estimating o by the sample standard deviation s (see Chapter 16 for a 
definition of s) would result in bandwidth 


h= a sn 9, 
For the Old Faithful data, this would give h = 23.64. 


QUICK EXERCISE 15.5 If we construct a kernel density estimate for the Old 
Faithful data with bandwidth h = 1.06sn—1/°, then on what interval is fp,p 
strictly positive if s = 68.48? 


Choice of the kernel 


To construct a kernel density estimate, one has to choose a kernel kK and a 
bandwidth h. The choice of kernel is less important. In Figure 15.7 we have 
plotted two kernel density estimates for the Old Faithful data of Table 15.1: 
one is constructed with the triweight kernel (solid line), and one with the 
Epanechnikov kernel (dotted line), both with the same bandwidth h = 24. As 
one can see, the graphs are very similar. If one wants to compare with the 
normal kernel, one should set the bandwidth of the normal kernel at about 
h/4. This has to do with the fact that the normal kernel is much more spread 
out than the two kernels mentioned here, which are zero outside [—1, 1]. 


0.010 
0.008 
0.006 


0.004 


0.002 
0 NL 
60 120 180 240 300 360 


Fig. 15.7. Kernel estimates of the Old Faithful data with different kernels: triweight 
(solid line) and Epanechnikov kernel (dotted), both with bandwidth h = 24. 


Boundary kernels 


In order to estimate the parameters of a software reliability model, failure data 
are collected. Usually the most desirable type of failure data results when the 


218 15 Exploratory data analysis: graphical summaries 
Table 15.3. Interfailure times between successive failures. 


30-1138 81115 9 2 91 112 15 188 
50 77 24 108 88 670 120 26 114 325 
55 242 68 422 180 10 1146 600 15 36 
4 0 8 227 65 176 58 457 300 97 
263 452 255 197 193 6 79 816 1351 148 
21 233 134 0357) 193-236 31 369 = 748 0 
232 3380 ©6365 «©1222 654310 16 529 379 44 
129 810 290 300 529 281 160 828 1011 445 
296 1755 1064 1783 860 983 707 33 868 = =—724 
2323 2930 1461 843 12 261 1800 865 1435 30 
143-108 O 3110 1247 943 700 875 245 729 
1897 447 «9386 «64460 «122 990) =: 948) 1082 22 75 
482 5509 100 10 1071 371 790 6150 3321 1045 
648 5485 1160 1864 4116 


Source: J.D. Musa, A. Iannino, and K. Okumoto. Software reliability: mea- 
surement, prediction, application. McGraw-Hill, New York, 1987; Table on 
page 305. 


failure times are recorded, or equivalently, the length of an interval between 
successive failures. The data in Table 15.3 are observed interfailure times in 
CPU seconds for a certain control software system. On the left in Figure 15.8 
a kernel density estimate of the observed interfailure times is plotted. Note 
that to the left of the origin, f,,, is positive. This is absurd, since it suggests 
that there are negative interfailure times. 

This phenomenon is a consequence of the fact that one uses a symmetric ker- 
nel. In that case, the resulting kernel density estimate will always be positive 
on the interval [x;—h, x; +h] for every element x; in the dataset. Hence, obser- 


0.0015 0.0015 . 
— with boundary 
kernel 
0.0010 0.00105 JP with symmetric 
ee np With kernel 


“ symmetric kernel 


0.0005 VA 0.0005 
Wi 0 


EG. Soneaeep ee: ae 2000 4000 6000 8000 0 2000 4000 6000 8000 


Fig. 15.8. Kernel density estimate of the software reliability data with symmetric 
and boundary kernel. 


15.4 The empirical distribution function 219 


vations close to zero will cause the kernel density estimate f;,,,, to be positive 
to the left of zero. It is possible to improve the kernel density estimate in a 
neighborhood of zero by means of a so-called boundary kernel. Without going 
into detail about the construction of such an improvement, we will only show 
the result of this. On the right in Figure 15.8 the histogram of the interfailure 
times is plotted together with the kernel density estimate constructed with a 
symmetric kernel (dotted line) and with the boundary kernel density estimate 
(solid line). The boundary kernel density estimate is 0 to the left of the ori- 
gin and is adjusted on the interval [0,h). On the interval [h,oo) both kernel 
density estimates are the same. 


15.4 The empirical distribution function 


Another way to graphically represent a dataset is to plot the data in a cumu- 
lative manner. This can be done using the empirical cumulative distribution 
function of the data. It is denoted by F;,, and is defined at a point x as the 
proportion of elements in the dataset that are less than or equal to z: 


number of elements in the dataset < x 
B,() = 
n 
To illustrate the construction of F,, consider the dataset consisting of the 
elements 


43 9 1 7. 


The corresponding empirical distribution function is displayed in Figure 15.9. 
For x < 1, there are no elements less than or equal to x, so that F;,(x) = 0. For 
1 <a <3, only the element 1 is less than or equal to x, so that F,,(a”) = 1/5. 
For 3 < a < 4, the elements 1 and 3 are less than or equal to xz, so that 
F,,(x) = 2/5, and so on. 

In general, the graph of F;, has the form of a staircase, with F(a) = 0 for all 
x smaller than the minimum of the dataset and F;,(a) = 1 for all # greater 
than the maximum of the dataset. Between the minimum and maximum, F), 
has a jump of size 1/n at each element of the dataset and is constant between 
successive elements. In Figure 15.9, the marks e and o are added to the graph 
to emphasize the fact that, for instance, the value of F,,(x) at x = 3 is 0.4, not 
0.2. Usually, we leave these out, and one might also connect the horizontal 
segments by vertical lines. 


In Figure 15.10 the empirical distribution functions are plotted for the Old 
Faithful data and the software reliability data. The fact that the Old Faithful 
data accumulate in the neighborhood of 120 and 270 is reflected in the graph 
of F,, by the fact that it is steeper at these places: the jumps of F, succeed each 
other faster. In regions where the elements of the dataset are more stretched 


220 15 Exploratory data analysis: graphical summaries 


1.0 — 


0.8 Pree 


0.6 Se 


0.4 o——o 


0.2 o_o 


0.0 


1 3 4 7 9 


Fig. 15.9. Empirical distribution function. 


out, the graph of F;, is flatter. Similar behavior can be seen for the software 
reliability data in the neighborhood of zero. The elements accumulate more 
close to zero, less as we move to the right. This is reflected by the empirical 
distribution function, which is very steep near zero and flattens out if we move 
to the right. 


The graph of the empirical distribution function for the Old Faithful data 
agrees with the histogram in Figure 15.1 whose height is the largest on the 
bins (90, 120] and (240, 270]. In fact, there is a one-to-one relation between the 
two graphical summaries of the data: the area under the histogram on a single 
bin is equal to the relative frequency of elements that lie in that bin, which is 
also equal to the increase of F,, on that bin. For instance, the area under the 
histogram on bin (240, 270] for the Old Faithful data is equal to 30-0.0092 = 


1.0 1.0 ee 
0.8 0.8 f 
0.6 0.6 
0.4 0.4 
0.2 0.2 
0.0 0.0 
60 120 180 240 300 360 0 2000 4000 6000 ~—-8000 


Old Faithful data Software data 


Fig. 15.10. Empirical distribution function of the Old Faithful data and the soft- 
ware reliability data. 


15.5 Scatterplot 221 


0.276 (see Quick exercise 15.2). On the other hand, F,,(270) = 215/272 = 
0.7904 and F,,(240) = 140/272 = 0.5147, whose difference F;,(270) — F,,(240) 
is also equal to 0.276. 


QUICK EXERCISE 15.6 Suppose that for a dataset consisting of 300 elements, 
the value of the empirical distribution function in the point 1.5 is equal to 
0.7. How many elements in the dataset are strictly greater than 1.5? 


Remark 15.3 (Ff, as a discrete distribution function). Note that 
F, satisfies the four properties of a distribution function: it is continuous 
from the right, Fn(#) > 0 as x > —oo, Fr(x) > 1 as & > oo and Fy, is 
nondecreasing. This means that F;, itself is a distribution function of some 
random variable. Indeed, Fy, is the distribution function of the discrete ran- 
dom variable that attains values 71, 72,...,@n with equal probability 1/n. 


15.5 Scatterplot 


In some situations one wants to investigate the relationship between two or 
more variables. In the case of two variables x and y, the dataset consists of 
pairs of observations: 


(21, YL), (x2, y2), a (Bie Ua} 


We call such a dataset a bivariate dataset in contrast to the univariate dataset, 
which consists of observations of one particular quantity. We often like to in- 
vestigate whether the value of variable y depends on the value of the variable z, 
and if so, whether we can describe the relation between the two variables. A 
first step is to take a look at the data, i.e., to plot the points (2;, y;) for 
i=1,2...,n. Such a plot is called a scatterplot. 


Drilling in rock 


During a study about “dry” and “wet” drilling in rock, six holes were drilled, 
three corresponding to each process. In a dry hole one forces compressed air 
down the drill rods to flush the cutting and the drive hammer, whereas in a 
wet hole one forces water. As the hole gets deeper, one has to add a rod of 
5 feet length to the drill. In each hole the time was recorded to advance 5 
feet to a total depth of 400 feet. The data in Table 15.4 are in 1/100 minute 
and are derived from the original data in [23]. The original data consisted of 
drill times for each of the six holes and contained missing observations and 
observations that were known to be too large. The data in Table 15.4 are the 
mean drill times of the bona fide observations at each depth for dry and wet 
drilling. 

One of the questions of interest is whether drill time depends on depth. To in- 
vestigate this, we plot the mean drill time against depth. Figure 15.11 displays 


222 15 Exploratory data analysis: graphical summaries 


Table 15.4. Mean drill time. 


Depth Dry Wet Depth Dry Wet 


5 640.67 830.00 205 803.33 962.33 
10 = 674.67 800.00 210 794.33 864.67 
15 708.00 711.33 215 760.67 805.67 
20 735.67 867.67 220 789.50 966.00 
25 704.33 940.67 225 904.50 1010.33 
30 723.83 941.33 230 940.50 936.33 
35 664.33 924.33 235 882.00 915.67 
40 727.67 873.00 240 783.50 956.33 
45 658.67 874.67 245 843.50 936.00 
50 ~=—658.00 843.33 250 813.50 803.67 
55 705.67 885.67 255 658.00 697.33 
60 700.00 881.67 260 702.50 795.67 
65 720.67 822.00 265 623.50 1045.33 
70 = 701.33 886.33 = 270 739.00 1029.67 
75 716.67 842.50 275 907.50 977.00 
80 649.67 874.67 280 846.00 1054.33 
85 ~—- 667.33 889.33 285 829.00 1001.33 
90 612.67 870.67 290 975.50 1042.00 
95 656.67 916.00 295 998.00 1200.67 
100 =614.00 888.33 300 1037.50 1172.67 
105 584.00 835.33 305 984.00 1019.67 
110 »=6619.67 776.33 310 972.50 990.33 
115 666.00 811.67 315 834.00 1173.33 
120 =695.00 874.67 320 675.00 1165.67 
125 702.00 846.00 325 686.00 1142.00 
130 = 739.67 920.67 330 963.00 1030.67 
135 =790.67 896.33 335 961.50 1089.67 
140 =730.33 810.33 = 340 932.00 1154.33 
145 674.00 912.33 345 1054.00 1238.50 
150 =749.00 862.33 350 1038.00 1208.67 
155 = 709.67 828.00 355 1238.00 1134.67 
160 769.00 812.67 360 927.00 1088.00 
165 663.00 795.67 365 850.00 1004.00 
170 =679.33 897.67 370 1066.00 1104.00 
175 740.67 881.00 375 962.50 970.33 
180 §=776.50 819.67 380 1025.50 1054.50 
185 = 688.00 853.33 385) =——-1205.50 = 1143.50 
190 761.67 844.33 390 1168.00 1044.00 
195 =800.00 919.00 395 1032.50 978.33 
200 =845.50 933.33 400 1162.00 1104.00 


Source: R. Penner and D.G. Watts. Mining information. The American 
Statistician, 45:4-9, 1991; Table 1 on page 6. 


15.5 Scatterplot 223 


Dry holes Wet holes 
1300 1300 
Y 1100 LY 1100 
£ & 
p »~ 
4 900 4 900 
= 700 = 700 
500 500 


Depth 


Fig. 15.11. Scatterplots of mean drill time versus depth. 


the resulting scatterplots for the dry and wet holes. The scatterplots seem to 
indicate that in the beginning the drill time hardly depends on depth, at least 
up to, let’s say, 250 feet. At greater depth, the drill time seems to vary over a 
larger range and increases somewhat with depth. A possible explanation for 
this is that the drill moved from softer to harder material. This was suggested 
by the fact that the drill hit an ore lens at about 250 feet and that the natural 
place such ore lenses occur is between two different materials (see [23] for 
details). 

A more important question is whether one can drill holes faster using dry 
drilling or wet drilling. The scatterplots seem to suggest that dry drilling 
might be faster. We will come back to this later. 


Predicting Janka hardness of Australian timber 


The Janka hardness test is a standard test to measure the hardness of wood. 
It measures the force required to push a steel ball with a diameter of 11.28 
millimeters (0.444 inch) into the wood to a depth of half the ball’s diameter. 
To measure Janka hardness directly is difficult. However, it is related to the 
density of the wood, which is comparatively easy to measure. In Table 15.5 
a bivariate dataset is given of density (x) and Janka hardness (y) of 36 Aus- 
tralian eucalypt hardwoods. 


In order to get an impression of the relationship between hardness and den- 
sity, we made a scatterplot of the bivariate dataset, which is displayed in 
Figure 15.12. It consists of all points (a;,y;) for i = 1,2,...,36. The scatter- 
plot might provide suggestions for the formula that describes the relationship 
between the variables x and y. In this case, a linear relationship between the 
two variables does not seem unreasonable. Later (Chapter 22) we will discuss 


224 15 Exploratory data analysis: graphical summaries 


Table 15.5. Density and hardness of Australian timber. 


Density Hardness Density Hardness Density Hardness 


24.7 484 39.4 1210 = 53.4 1880 
24.8 427 39.9 989 56.0 1980 
27.3 413 40.3 1160 = 56.5 1820 
28.4 517 ~=— 40.6 1010) 57.3 2020 
28.4 549 = 40.7 1100 57.6 1980 
29.0 648 = 40.7 1130 59.2 2310 
30.3 587 42.9 1270 = 59.8 1940 
32.7 704 = 45.8 1180 = 66.0 3260 
35.6 979 46.9 1400 67.4 2700 
38.5 914 48.2 1760 = 68.8 2890 
38.8 1070 = 51.5 1710 ~— 69.1 2740 
39.3 1020 51.5 2010 = 69.1 3140 


Source: E.J. Williams. Regression analysis. John Wiley & Sons Inc., New 
York, 1959; Table 3.1 on page 43. 


how one can establish such a linear relationship by means of the observed 
pairs. 


QUICK EXERCISE 15.7 Suppose we have a eucalypt hardwood tree with den- 
sity 65. What would your prediction be for the corresponding Janka hardness? 


Hardness 
. 
e 


20 30 40 50 60 70 80 


Wood density 


Fig. 15.12. Scatterplot of Janka hardness versus density of wood. 


15.6 Solutions to the quick exercises 225 


15.6 Solutions to the quick exercises 


15.1 There are 272 elements in the dataset. The 91st and 182nd elements 
of the ordered data divide the dataset in three groups, each consisting of 90 
elements. From a closer look at Table 15.2 we find that these two elements 
are 145 and 260. 


15.2 In Table 15.2 one can easily count the number of observations in each 
of the bins (90, 120], ..., (300, 330]. The heights on each bin can be computed 
by dividing the number of observations in each bin by 272-30 = 8160. We get 
the following: 


Bin Count Height Bin Count Height 
(90,120) 55 0.0067 (210,240} 34 0.0042 
(120,150) 37 0.0045 (240,270) 75 0.0092 
(150, 180] 5 0.0006 (270,300) 54 0.0066 
(180, 210] 9 0.0011 (300, 330] 3 0.0004 


15.3 From Table 15.2 we see that we must cover an interval of length of at 
least 306 — 96 = 210 with bins of width b = 3.49 - 68.48 - 2721/3 = 36.89. 
Since 210/36.89 = 5.69, we need at least six bins to cover the whole dataset. 


15.4 By means of formula (15.1), we can write 


ad 1 sae t— 2; 
n,h at Te t. 
[tana ce es ; ja 


For any i = 1,...,n, we find by change of integration variables t = hu + 2; 


that as ” 
i K (5) a=a | K (u) du=h, 


where we also use condition (K1). This directly yields 


/ eee ee ey 
_ nh 


15.5 The kernel density estimate will be strictly positive between the min- 
imum minus / and the maximum plus h. The bandwidth equals h = 1.06- 
68.48 - 272-1/5 = 23.66. From Table 15.2, we see that this will be between 
96 — 23.66 = 72.34 and 306 + 23.66 = 329.66. 


15.6 By definition the number of elements less than or equal to 1.5 is 
F399(1.5) - 300 = 210. Hence 90 elements are strictly greater than 1.5. 


15.7 Just by drawing a straight line that seems to fit the datapoints well, the 
authors predicted a Janka hardness of about 2700. 


226 15 Exploratory data analysis: graphical summaries 


15.7 Exercises 


15.1 In [33] Stephen Stigler discusses data from the Edinburgh Medical and 
Surgical Journal (1817). These concern the chest circumference of 5732 Scot- 
tish soldiers, measured in inches. The following information is given about the 
histogram with bin width 1, the first bin starting at 32.5. 


Bin Count Bin Count 


(32.5, 33.5 3 (40.5, 41.5] 935 
(33.5, 34.5 19 (41.5, 42.5] 646 
(34.5, 35.5 81 = (42.5, 43.5] 313 
(35.5, 36.5] 189 (43.5, 44.5] 168 
(36.5, 37.5] 409 (44.5, 45.5 50 
(37.5, 38.5] 753 (45.5, 46.5 18 
(38.5, 39.5] 1062 (46.5, 47.5 3 
(39.5, 40.5] 1082 (47.5, 48.5 1 


Source: S.M. Stigler. The history of statistics — The measurement of wncer- 
tainty before 1900. Cambridge, Massachusetts, 1986. 


a. Compute the height of the histogram on each bin. 


b. Make a sketch of the histogram. Would you view the dataset as being 
symmetric or skewed? 


15.2 Recall the example of the space shuttle Challenger in Section 1.4. The 
following list contains the launch temperatures in degrees Fahrenheit during 
previous takeoffs. 


66 70 69 68 67 72 73 70 57 63 70 78 
67 53 67 75 70 81 76 79 75 76 58 


Source: Presidential commission on the space shuttle Challenger accident. 
Report on the space shuttle Challenger accident. Washington, DC, 1986; table 
on pages 129-131. 


a. Compute the heights of a histogram with bin width 5, the first bin starting 
at 50. 

b. On January 28, 1986, during the launch of the space shuttle Challenger, 
the temperature was 31 degrees Fahrenheit. Given the dataset of launch 
temperatures of previous takeoffs, would you consider 31 as a representa- 
tive launch temperature? 


15.3 © In an article in Biometrika, an example is discussed about mine dis- 
asters during the period from March 15, 1851, to March, 22, 1962. A dataset 
has been obtained of 190 recorded time intervals (in days) between successive 
coal mine disasters involving ten or more men killed. The ordered data are 
listed in Table 15.6. 


15.7 Exercises 227 


Table 15.6. Number of days between successive coal mine disasters. 


0 1 1 2 2 3 4 4 4 6 

7 10 11 12 12 12 13 15 15 16 
16 16 #17 «17 18 19 19 19 20 20 
22 23 24 25 27 28 29 29 29 31 
31 3238884 34 36 36 37 40 41 
41 42 43 45 47 48 49 50 53 54 
54. 55 56 59 59 61 61 65 66 66 
70 72 75 78 78 78 80 80 81 88 
91 92 93 93 95 95 96 96 97 99 
101 108 110 112 113 114 120 120) 123) 123 
124 124 125 127 129 131 134 137) 139) 143 
144 145 151 154 156 157 176 182 186 187 
188 189 190 193 194 197 202 203 208 215 
216 217 217 217 218 224 225 228 232 233 
250 255 275 275 275 276 286 292 307 307 
312 312 315 324 326 326 329 330 336 ©6345 
348 354 361 364 368 378 388 420 431 456 
462 467 498 517 536 538 566 632 644 745 
806 826 871 952 1205 1312 1358 1630 1643 2366 


Source: R.G. Jarrett. A note on the intervals between coal mining disasters. 
Biometrika, 66:191-193, 1979; by permission of the Biometrika Trustees. 


a. Compute the height on each bin of the histogram with bins [0,250], 
(250, 500], ..., (2250, 2500]. 

b. Make a sketch of the histogram. Would you view the dataset as being 
symmetric or skewed? 


15.4 ) The ordered software data (see also Table 15.3) are given in the fol- 
lowing list. 


0 0 0 2 4 6 8 9 10 10 
10 12 15 15 16 21 22 24 26 30 
30 31 33 36 dd 50 55 58 65 68 
75 77 79 81 88 91 97 100 108 108 
112) 113) 114) 115) 120) 122) 129) 134) 138) :148 
148 160 176 180 193 193 197 227 232 233 
236 242 245 255 261 263 281 290 296 300 
300) «3253380 3857) 3865 3869) 871) 08879) 386 422 
445 446 447 452 457 482 529 529 543 600 
648 670 700 707 724 729 748 790 810 816 
828 843 «6860 86865 §=6—868 «68875 — (943 948983 «990 
1011 1045 1064 1071 1082 1146 1160 1222 1247 1351 
1435 1461 1755 1783 1800 1864 1897 2323 2930 3110 
3321 4116 5485 5509 6150 


228 15 Exploratory data analysis: graphical summaries 


a. Compute the heights on each bin of the histogram with bins [0, 500], 
(500, 1000], and so on. 

b. Compute the value of the empirical distribution function in the endpoints 
of the bins. 

c. Check that the area under the histogram on bin (1000, 1500] is equal to 
the increase F;,(1500) — F;,(1000) of the empirical distribution function 
on this bin. Actually, this is true for each single bin (see Exercise 15.11). 


15.5 LE] Suppose we construct a histogram with bins [0,1], (1,3], (3,5], (5,8), 
(8,11), (11,14], and (14,18]. Given are the values of the empirical distribution 
function at the boundaries of the bins: 


t 0 a2 3 5 8 11 14 18 
Fi(t) 0 0.225 0.445 0.615 0.735 0.805 0.910 1.000 


Compute the height of the histogram on each bin. 


15.6 H Given is the following information about a histogram: 


Bin Height 


(0,2) 0.245 
(2,4) 0.130 
(4,7} 0.050 
(7,11] 0.020 


(11,15] 0.005 


Compute the value of the empirical distribution function in the point t = 7. 


15.7 In Exercise 15.2 a histogram was constructed for the Challenger data. On 
which bin does the empirical distribution function have the largest increase? 


15.8 Define a function kK by 
K(u) =cos(zu) for -lL<u<1 


and K(u) = 0 elsewhere. Check whether K satisfies the conditions (K1)—(K3) 
for a kernel function. 


15.9 On the basis of the duration of an eruption of the Old Faithful geyser, 
park rangers try to predict the waiting time to the next eruption. In Fig- 
ure 15.13 a scatterplot is displayed of the duration and the time to the next 
eruption in seconds. 


a. Does the scatterplot give reason to believe that the duration of an eruption 
influences the time to the next eruption? 


15.7 Exercises 229 


100 
to ge ‘r se oe 
ME Yt on 2 

2 80 woke sate S35 . 
& cas are Se 
5 . : °° 88— 
ro) oY 2g “oe 
& ef eo ete . : . 
£ 60 he To Fe 
a Brae 
e The 

40 


Duration 


Fig. 15.13. Scatterplot of the Old Faithful data. 


b. Suppose you have just observed an eruption that lasted 250 seconds. What 
would you predict for the time to the next eruption? 

c. The dataset of durations shows two modes, i.e., there are two places where 
the data accumulate (see, for instance, the histogram in Figure 15.1). How 
many modes does the dataset of waiting times show? 


15.10 Figure 15.14 displays the graph of an empirical distribution function 
of a dataset consisting of 200 elements. How many modes does the dataset 


show? 
1.0 fo 


0.8 
0.6 
0.4 
0.2 
0.0 
0 5 10 15 20 25 


Fig. 15.14. Empirical distribution function. 


15.11 & Given is a histogram and the empirical distribution function F,, of 
the same dataset. Show that the height of the histogram on a bin (a, }] is 


230 15 Exploratory data analysis: graphical summaries 


equal to 


15.12 H Let fn,» be a kernel estimate. As mentioned in Section 15.3, frp 
itself is a probability density. 


a. Show that the corresponding expectation is equal to 
co 
fi thn n(t) dt = 2a: 
—oco 


Hint: you might consult the solution to Quick exercise 15.4. 
b. Show that the second moment corresponding to f,,7, satisfies 


[. # fanlt rhe +1 u?K(u) du. 


16 


Exploratory data analysis: numerical 
summaries 


The classical way to describe important features of a dataset is to give several 
numerical summaries. We discuss numerical summaries for the center of a 
dataset and for the amount of variability among the elements of a dataset, and 
then we introduce the notion of quantiles for a dataset. To distinguish these 
quantities from corresponding notions for probability distributions of random 
variables, we will often add the word sample or empirical; for instance, we will 
speak of the sample mean and empirical quantiles. We end this chapter with 
the boxplot, which combines some of the numerical summaries in a graphical 
display. 


16.1 The center of a dataset 


The best-known method to identify the center of a dataset is to compute the 


sample mean 
T+Xqt+++:+I2n 


n 


no 


(16.1) 


For the sake of notational convenience we will sometimes drop the subscript n 
and write Z instead of Z,. The following dataset consists of hourly tempera- 
tures in degrees Fahrenheit (rounded to the nearest integer), recorded at Wick 
in northern Scotland from 5 p.m. December 31, 1960, to 3 a.m. January 1, 
1961. The sample mean of the 11 measurements is equal to 44.7. 


43 43 41 41 41 42 43 58 58 41 41 


Source: V. Barnett and T. Lewis. Outliers in statistical data. Third edition, 
1994. © John Wiley & Sons Limited. Reproduced with permission. 


Another way to identify the center of a dataset is by means of the sample 
median, which we will denote by Med(a1,22,...,@,) or briefly Med,. The 
sample median is defined as the middle element of the dataset when it is put 
in ascending order. When n is odd, it is clear what this means. When n is even, 


232 16 Exploratory data analysis: numerical summaries 


we take the average of the two middle elements. For the Wick temperature 
data the sample median is equal to 42. 


QUICK EXERCISE 16.1 Compute the sample mean and sample median of the 
dataset 


4.6 3.0 3.2 4.2 5.0. 


Both methods have pros and cons. The sample mean is the natural analogue 
for a dataset of what the expectation is for a probability distribution. However, 
it is very sensitive to outlers, by which we mean observations in the dataset 
that deviate a lot from the bulk of the data. 


To illustrate the sensitivity of the sample mean, consider the Wick tempera- 
ture data displayed in Figure 16.1. The values 58 and 58 recorded at midnight 
and 1 a.m. are clearly far from the bulk of the data and give grounds for 
concern whether they are genuine (58 degrees Fahrenheit seems very warm 
at midnight for New Year’s in northern Scotland). To investigate their effect 
on the sample mean we compute the average of the data, leaving out these 
measurements, which gives 41.8 (instead of 44.7). The sample median of the 
data is equal to 41 (instead of 42) when leaving out the measurements with 
value 58. The median is more robust in the sense that it is hardly affected by 
a few outliers. 


Temperature 


17pm. 19p.m. 21p.m. 23p.m. lam 3am 


Time of day 
Fig. 16.1. The Wick temperature data. 


It should be emphasized that this discussion is only meant to illustrate the 
sensitivity of the sample mean and by no means is intended to suggest we leave 
out measurements that deviate a lot from the bulk of the data! It is important 
to be aware of the presence of an outlier. In that case, one could try to find out 
whether there is perhaps something suspicious about this measurement. This 
might lead to assigning a smaller weight to such a measurement or even to 


16.2 The amount of variability of a dataset 233 


removing it from the dataset. However, sometimes it is possible to reconstruct 
the exact circumstances and correct the measurement. For instance, after 
further inquiry in the temperature example it turned out that at midnight 
the meteorological office changed its recording unit from degrees Fahrenheit 
to 1/10th degree Celsius (so 58 and 41 should read 5.8°C and 4.1°C). The 
corrected values in degrees Fahrenheit (to the nearest integer) are 


43 43 41 41 41 42 43 42 42 39 39. 
For the corrected data the sample mean is 41.5 and the sample median is 42. 


QUICK EXERCISE 16.2 Consider the same dataset as in Quick exercise 16.1. 
Suppose that someone misreads the dataset as 


4.6 30 3.2 4.2 50. 


Compute the sample mean and sample median and compare these values with 
the ones you found in Quick exercise 16.1. 
16.2 The amount of variability of a dataset 


To quantify the amount of variability among the elements of a dataset, one 
often uses the sample variance defined by 


1 n 
s? => \ = En)*. 
w=1 


n—-14 


Up to a scaling factor this is equal to the average squared deviation from Zn. 
At first sight, it seems more natural to define the sample variance by 


Why we choose the factor 1/(n—1) instead of 1/n will be explained later (see 
Chapter 19). Because s? is in different units from the elements of the dataset, 
one often prefers the sample standard deviation 


which is measured in the same units as the elements of the dataset itself. 

Just as the sample mean, the sample standard deviation is very sensitive to 
outliers. For the (uncorrected) Wick temperature data the sample standard 
deviation is 6.62, or 0.97 if we leave out the two measurements with value 58. 


234 16 Exploratory data analysis: numerical summaries 


For the corrected data the standard deviation is 1.44. A more robust measure 
of variability is the median of absolute deviations or MAD, which is defined 
as follows. Consider the absolute deviation of every element x; with respect 
to the sample median: 


|x; — Med(a1, 22,..-,2n)| 


or briefly 
|x; = Med,)|. 


The MAD is obtained by taking the median of all these absolute deviations 
MAD(a1, #2,.--,;%n) = Med(|a1 — Med, |,...,|a, — Med,|). (16.2) 


QUICK EXERCISE 16.3 Compute the sample standard deviation for the dataset 
of Quick exercise 16.1 for which it is given that the values of x; — Z, are: 


—1.0, 0.6, —0.8, 0.2, 1.0. 
Also compute the MAD for this dataset. 


Just as the sample median, the MAD is hardly affected by outliers. For the 
(uncorrected) Wick temperature data the MAD is 1 and equal to 0 if we leave 
out the two measurements with value 58 (the value 0 seems a bit strange, 
but is a consequence of the fact that the observations are given in degrees 
Fahrenheit rounded to the nearest integer). For the corrected data the MAD 
is 1. 


QUICK EXERCISE 16.4 Compute the sample standard deviation for the mis- 
read dataset of Quick exercise 16.2 for which it is given that the values of 
Li Bn are: 


11.6, —13.8, —15.2, 14.2, 31.6. 


Also compute the MAD for this dataset and compare both values with the 
ones you found in Quick exercise 16.3. 


16.3 Empirical quantiles, quartiles, and the IQR 


The sample median divides the dataset in two more or less equal parts: about 
half of the elements are less than the median and about half of the elements 
are greater than the median. More generally, we can divide the dataset in 
two parts in such a way that a proportion p is less than a certain number 
and a proportion 1 — p is greater than this number. Such a number is called 
the 100p empirical percentile or the pth empirical quantile and is denoted by 
dn(p). For a suitable introduction of empirical quantiles we need the notion 
of order statistics. 


16.3 Empirical quantiles, quartiles, and the IQR 235 


The order statistics consist of the same elements as in the original dataset 
©1,%2,.-.,n, but in ascending order. Denote by xj) the kth element in the 
ordered list. Then 

€a) S fay So" S Bn) 


are called the order statistics of 71, %2,...,%y. The order statistics of the Wick 
temperature data are 


41 41 41 41 41 42 43 43 43 58 58. 


Note that by putting the elements in order, it is possible that successive order 
statistics are the same, for instance, 7(;) = --- = %(5) = 41. Another example 
is Table 15.2, which lists the order statistics of the Old Faithful dataset. 


To compute empirical quantiles one linearly interpolates between order statis- 
tics of the dataset. Let 0 < p < 1, and suppose we want to compute the pth 
empirical quantile for a dataset 71, 2%2,...,2%,. The following computation is 
based on requiring that the ith order statistic is the i/(m + 1) quantile. If we 
denote the integer part of a by |a|, then the computation of g,(p) runs as 
follows: 
Qn(P) = XR) + (L441) — LK) 

with & = |p(n+1)| and a = p(n+1)—k. On the left in Figure 16.2 the 
relation between the pth quantile and the empirical distribution function is 
illustrated for the Old Faithful data. 


1 1.00 
0.75 
Pp 
0.50 
0.25 
0 0.00 
pth empirical quantile Lower Median Upper 


quartile quartile 


Fig. 16.2. Empirical quantile and quartiles for the Old Faithful data. 


QUICK EXERCISE 16.5 Compute the 55th empirical percentile for the Wick 
temperature data. 


236 16 Exploratory data analysis: numerical summaries 


Lower and upper quartiles 


Instead of identifying only the center of the dataset, Tukey [35] suggested 
to give a five-number summary of the dataset: the minimum, the maximum, 
the sample median, and the 25th and 75th empirical percentiles. The 25th 
empirical percentile q,,(0.25) is called the lower quartile and the 75th empirical 
percentile gq, (0.75) is called the upper quartile. Together with the median, the 
lower and upper quartiles divide the dataset in four more or less equal parts 
consisting of about one quarter of the number of elements. The relation of 
the two quartiles and the median with the empirical distribution function is 
illustrated for the Old Faithful data on the right of Figure 16.2. The distance 
between the lower quartile and the median, relative to the distance between 
the upper quartile and the median, gives some indication on the skewness of 
the dataset. The distance between the upper and lower quartiles is called the 
interquartile range, or IQR: 


IQR = gn(0.75) — qp,(0.25). 


The IQR specifies the range of the middle half of the dataset. It could also 
serve as a robust measure of the amount of variability among the elements of 
the dataset. For the Old Faithful data the five-number summary is 


Minimum Lower quartile Median Upper quartile Maximum 
96 129.25 240 267.75 306 


and the IQR is 138.5. 


QUICK EXERCISE 16.6 Compute the five-number summary for the (uncor- 
rected) Wick temperature data. 


16.4 The box-and-whisker plot 


Tukey [35] also proposed visualizing the five-number summary discussed in 
the previous section by a so-called box-and-whisker plot, briefly boxplot. Fig- 
ure 16.3 displays a boxplot. The data are now on the vertical axis, where we 
left out the numbers on the axis in order to explain the construction of the 
figure. The horizontal width of the box is irrelevant. In the vertical direction 
the box extends from the lower to the upper quartile, so that the height of the 
box is precisely the IQR. The horizontal line inside the box corresponds to the 
sample median. Up from the upper quartile we measure out a distance of 1.5 
times the IQR and draw a so-called whisker up to the largest observation that 
lies within this distance, where we put a horizontal line. Similarly, down from 
the lower quartile we measure out a distance of 1.5 times the IQR and draw 
a whisker to the smallest observation that lies within this distance, where 
we also put a horizontal line. All other observations beyond the whiskers are 
marked by o. Such an observation is called an outlier. 


16.4 The box-and-whisker plot 237 


Upper quartile+1.5-IQR 


Maximum 
Upper quartile 


Median 
Lower quartile 


Lower quartile—1.5-IQR 


Minimum 


Fig. 16.3. A boxplot. 


In Figure 16.4 the boxplots of the Old Faithful data and of the software relia- 
bility data (see also Chapter 15) are displayed. The skewness of the software 
reliability data produces a boxplot with whiskers of very different length and 
with several observations beyond the upper quartile plus 1.5 times the IQR. 
The boxplot of the Old Faithful data illustrates one of the shortcomings of the 
boxplot; it does not capture the fact that the data show two separate peaks. 
However, the position of the sample median inside the box does suggest that 
the dataset is skewed. 


QUICK EXERCISE 16.7 Suppose we want to construct a boxplot of the (uncor- 
rected) Wick temperature data. What is the height of the box, the length of 
both whiskers, and which measurements fall outside the box and whiskers? 
Would you consider the two values 58 extreme outliers? 


6 

6000 
5 
4 4000 
3 

2000 
2 
I 0 


Old Faithful data Software data 
Fig. 16.4. Boxplot of the Old Faithful data and the software data. 


238 16 Exploratory data analysis: numerical summaries 


Using boxplots to compare several datasets 


Although the boxplot provides some information about the structure of the 
data, such as center, range, skewness or symmetry, it is a poor graphical 
display of the dataset. Graphical summaries such as the histogram and kernel 
density estimate are more informative displays of a single dataset. Boxplots 
become useful if we want to compare several sets of data in a simple graphical 
display. In Figure 16.5 boxplots are displayed of the average drill time for 
dry and wet drilling up to a depth of 250 feet for the drill data discussed in 
Section 15.5 (see also Table 15.4). It is clear that the boxplot corresponding 
to dry drilling differs from that corresponding to wet drilling. However, the 
question is whether this difference can still be attributed to chance or is caused 
by the drilling technique used. We will return to this type of question in 
Chapter 25. 


1000 
900 
800 
700 


600 


Dry Wet 


Fig. 16.5. Boxplot of average drill times. 


16.5 Solutions to the quick exercises 


16.1 The average is 


— 4643.043.24+4245.0 20 _ 
= : =>= 


The median is the middle element of 3.0, 3.2, 4.2, 4.6, and 5.0, which gives 
Med, = 4.2. 


4. 


In 


16.2 The average is 


_ 46+ 304+3.2+42+50 90 
ees 


16.5 Solutions to the quick exercises 239 


which differs 14.4 from the average we found in Quick exercise 16.1. The 
median is the middle element of 3.2, 4.2, 4.6, 30, and 50. This gives Med, = 
4.6, which only differs 0.4 from the median we found in Quick exercise 16.1. 
As one can see, the median is hardly affected by the two outliers. 


16.3 The sample variance is 


go (P+ (0.6)? + (-0.8)? + (02)? + (1.0)? _ 3.04 


= 0.76 
7 5-1 4 


so that the sample standard deviation is s,, = 0.76 = 0.872. The median is 
4.2, so that the absolute deviations from the median are given by 
04 1.2 1.0 0.0 0.8. 


The MAD is the median of these numbers, which is 0.8. 


16.4 The sample variance is 


2 _ (11.6)? + (—13.8)? + (—15.2)? + (—14.2)? + (81.6)? 1756.24 


= 439.06 
Ss), 5a 39 


so that the sample standard deviation is 5s, = V439.06 = 20.95, which is a 
difference of 20.19 from the value we found in Quick exercise 16.3. The median 
is 4.6, so that the absolute deviations from the median are given by 


0.00 25.4 14 O04 45.4. 


The MAD is the median of these numbers, which is 1.4. Just as the median, 
the MAD is hardly affected by the two outliers. 


16.5 We have k = |0.55- 12] = [6.6] = 6, so that a = 0.6. This gives 
Gn (0.55) = 26) + 0.6 - (x7) — 2(6)) = 42 + 0.6 - (43 — 42) = 42.6. 
16.6 From the order statistics of the Wick temperature data 
41 41 41 41 41 42 43 43 43 58 58 


it can be seen immediately that minimum, maximum, and median are given by 
41, 58, and 42. For the lower quartile we have k = |0.25-12| = 3, so that a = 0 
and qn(0.25) = a3) = 41. For the upper quartile we have k = [0.75 -12| = 9, 
so that again a = 0 and q,,(0.75) = r(g) = 43. Hence for the Wick temperature 
data the five-number summary is 


Minimum Lower quartile Median Upper quartile Maximum 
41 41 42 43 58 


240 16 Exploratory data analysis: numerical summaries 


16.7 From the five-number summary for the Wick temperature data (see 
Quick exercise 16.6), it follows immediately that the height of the box is the 
IQR: 43 — 41 = 2. If we measure out a distance of 1.5 times 2 down from the 
lower quartile 41, we see that the smallest observation within this range is 
41, which means that the lower whisker has length zero. Similarly, the upper 
whisker has length zero. The two measurements with value 58 are outside the 
box and whiskers. The two values 58 are clearly far away from the bulk of the 
data and should be considered extreme outliers. 


58 fos) 


43 
: = 
Al 


16.6 Exercises 


16.1 ©) Use the order statistics of the software data as given in Exercise 15.4 
to answer the following questions. 


a. Compute the sample median. 
b. Compute the lower and upper quartiles and the IQR. 
c. Compute the 37th empirical percentile. 


16.2 Compute for the Old Faithful data the distance of the lower and upper 
quartiles to the median and explain the difference. 


16.3 Recall the example about the space shuttle Challenger in Section 1.4. 
The following table lists the order statistics of launch temperatures during 
take-offs in degrees Fahrenheit, including the launch temperature on Jan- 
uary 28, 1986. 


31 53 57 58 63 66 67 67 67 68 69 70 
70 70 70 72 73 75 75 76 76 78 79 81 


a. Find the sample median and the lower and upper quartiles. 
b. Sketch the boxplot of this dataset. 


16.6 Exercises 241 


c. On January 28, 1986, the launch temperature was 31 degrees Fahrenheit. 
Comment on the value 31 with respect to the other data points. 


16.4 4] The sample mean and sample median of the uncorrected Wick tem- 
perature data (in degrees Fahrenheit) are 44.7 and 42. We transform the data 
from degrees Fahrenheit (a;) to degrees Celsius (y;) by means of the formula 


5 


which gives the following dataset 


55 55 5 5 5 30 55 130 130 5 5. 


9 9 
a. Check that 7, = 3(Zn — 32). 
b. Is it also true that Med(y1,...,yn) = 3(Med(a1,..., an) — 32)? 


c. Suppose we have a dataset 21,%2,...,%» and construct yj, y2,---5Yn 
where y; = ax; + b with a and b being real numbers. Do similar rela- 
tions hold for the sample mean and sample median? If so, state them. 


16.5 Consider the uncorrected Wick temperature data in degrees Fahrenheit 
(a;) and the corresponding temperatures in degrees Celsius (y;) as given in 
Exercise 16.4. The sample standard deviation and the MAD for the Wick data 
are 6.62 and 1. 


a. Let se and sc denote the sample standard deviations of 71,272,...,2n 
and y1,Y2,---;Yn respectively. Check that so = SSP. 


b. Let MADp and MADg denote the MAD of 21, %2,...,@p and Y1, y2,---5Yn 
respectively. Is it also true that MADco = 3MAD F? 


c. Suppose we have a dataset 21,22,...,%» and construct 1, y2,---5Yn 
where y; = ax; + b with a and 6 being real numbers. Do similar rela- 
tions hold for the sample standard deviation and the MAD? If so, state 
them. 


16.6 Consider two datasets: 1,5,9 and 2,4,6,8. 


a. Denote the sample means of the two datasets by Z and y. Is it true that the 
average (+ 9) /2 of Z and % is equal to the sample mean of the combined 
dataset with 7 elements? 


b. Suppose we have two other datasets: one of size n with sample mean 
Z, and another dataset of size m with sample mean Y,. Is it always 
true that the average (Zn + Jm)/2 of Zp, and Ym is equal to the sample 
mean of the combined dataset with n +m elements? If no, then provide 
a counterexample. If yes, then explain this. 


c. Ifm=n, is (En +Ym)/2 equal to the sample mean of the combined dataset 
with n +m elements? 


242 16 Exploratory data analysis: numerical summaries 
16.7 Consider the two datasets from Exercise 16.6. 


a. Denote the sample medians of the two datasets by Med; and Med,. Is it 
true that the sample median (Med, + Med,)/2 of the two sample medians 
is equal to the sample median of the combined dataset with 7 elements? 


b. Suppose we have two other datasets: one of size n with sample median 
Med, and another dataset of size m with sample median Med,. Is it 
always true that the sample median (Med, + Med, )/2 of the two sample 
medians is equal to the sample median of the combined dataset with n-+-m 
elements? If no, then provide a counterexample. If yes, then explain this. 


c. What if m= n? 


16.8 H Compute the MAD for the combined dataset of 7 elements from Ex- 
ercise 16.6. 


16.9 Consider a dataset 71, %2,...,% with 2; 4 0. We construct a second 
dataset y1,y2,---,;Yn, where 
1 
¥i=—. 
Xi 
a. Suppose dataset 71,2%2,...,%, consists of —6,1,15. Is it true that 93 = 


1/%3? 
b. Suppose that n is odd. Is it true that 9, = 1/%n? 


c. Suppose that n is odd and each x; > 0. Is it true that Med(y1,..-, yn) = 
1/Med(#1,...,2%n)? What about when n is even? 


16.10 LJ A method to investigate the sensitivity of the sample mean and the 
sample median to extreme outliers is to replace one or more elements in a 
given dataset by a number y and investigate the effect when y goes to infinity. 
To illustrate this, consider the dataset from Quick Exercise 16.1: 


4.6 3.0 3.2 4.2 5.0 
with sample mean 4 and sample median 4.2. 


a. We replace the element 3.2 by some real number y. What happens with 
the sample mean and the sample median of this new dataset as y — oo? 


b. We replace a number of elements by some real number y. How many 
elements do we need to replace so that the sample median of the new 
dataset goes to infinity as y — oo? 


c. Suppose we have another dataset of size n. How many elements do we 
need to replace by some real number y, so that the sample mean of the 
new dataset goes to infinity as y — co? And how many elements do we 
need to replace, so that the sample median of the new dataset goes to 
infinity? 


16.6 Exercises 243 


16.11 Just as in Exercise 16.10 we investigate the sensitivity of the sample 
standard deviation and the MAD to extreme outliers, by considering the same 
dataset with sample standard deviation 0.872 and MAD equal to 0.8. Answer 
the same three questions for the sample standard deviation and the MAD 
instead of the sample mean and sample median. 


16.12 LE) Compute the sample mean and sample median for the dataset 


in case N is odd and in case N is even. You may use the fact that 


N(N +1) 


Dae 


16.13 Compute the sample standard deviation and MAD for the dataset 
—N,...,—-1,0,1,...,N. 


You may use the fact that 


N(N +1)(2N +1) 


1° 4+2?4---+N?%= ; 


16.14 Check that the 50th empirical percentile is the sample median. 


16.15 H The following rule is useful for the computation of the sample vari- 
ance (and standard deviation). Show that 


where Z, = ‘eo xi) /n. 


16.16 Recall Exercise 15.12, where we computed the mean and second mo- 
ment corresponding to a density estimate f,,,. Show that the variance corre- 
sponding to fn,n satisfies: 


1. t? fa, n(t) dt— ([- EE ule) av) 7 pein) HP i. u?K(u) du. 


16.17 Suppose we have a dataset 11,¥2,...,U%n. Check that if p = i/(n + 1) 
the pth empirical quantile is the ith order statistic. 


17 


Basic statistical models 


In this chapter we introduce a common statistical model. It corresponds to 
the situation where the elements of the dataset are repeated measurements 
of the same quantity and where different measurements do not influence each 
other. Next, we discuss the probability distribution of the random variables 
that model the measurements and illustrate how sample statistics can help 
to select a suitable statistical model. Finally, we discuss the simple linear 
regression model that corresponds to the situation where the elements of the 
dataset are paired measurements. 


17.1 Random samples and statistical models 


In Chapter 1 we briefly discussed Michelson’s experiment conducted between 
June 5 and July 2 in 1879, in which 100 measurements were obtained on the 
speed of light. The values are given in Table 17.1 and represent the speed 
of light in air in km/sec minus 299000. The variation among the 100 values 
suggests that measuring the speed of light is subject to random influences. As 
we have seen before, we describe random phenomena by means of a probability 
model, i.e., we interpret the outcome of an experiment as a realization of 
some random variable. Hence the first measurement is modeled by a random 
variable X, and the value 850 is interpreted as the realization of X,. Similarly, 
the second measurement is modeled by a random variable X2 and the value 740 
is interpreted as the realization of X2. Since both measurements are obtained 
under the same experimental conditions, it is justified to assume that the 
probability distributions of X; and X2 are the same. More generally, the 100 
measurements are modeled by random variables 


X1,X2,.--,;X100 


with the same probability distribution, and the values in Table 17.1 are inter- 
preted as realizations of X1, X2,...,X109. Moreover, because we believe that 


246 17 Basic statistical models 


Table 17.1. Michelson data on the speed of light. 


850 740 900 1070 930 850 950 980 980 880 
1000 980 930 650 760 810 1000 1000 960 960 
960 940 960 940 880 800 850 880 900 840 
830 790 810 880 880 830 800 790 760 800 
880 880 880 860 720 720 620 860 970 950 
880 910 850 870 840 840 850 840 840 840 
890 810 810 820 800 770 760 740 750 760 
910 920 890 860 880 720 840 850 850 780 
890 840 780 810 760 810 790 810 820 850 
870 870 810 740 810 940 950 800 810 870 


Source: E.N. Dorsey. The velocity of light. Transactions of the American 
Philosophical Society. 34(1):1-110, 1944; Table 22 on pages 60-61. 


Michelson took great care not to have the measurements influence each other, 
the random variables X,, X2,..., Xi99 are assumed to be mutually indepen- 
dent (see also Remark 3.1 about physical and stochastic independence). Such 
a collection of random variables is called a random sample or briefly, sample. 


RANDOM SAMPLE. A random sample is a collection of random vari- 
ables X1, Xo,...,Xn, that have the same probability distribution 
and are mutually independent. 


If F' is the distribution function of each random variable X; in a random 
sample, we speak of a random sample from F’. Similarly, we speak of a random 
sample from a density f, arandom sample from an N(j, 07) distribution, etc. 


QUICK EXERCISE 17.1 Suppose we have a random sample X 1, X2 from a dis- 
tribution with variance 1. Compute the variance of X; + Xo. 


Properties that are inherent to the random phenomenon under study may 
provide additional knowledge about the distribution of the sample. Recall 
the software data discussed in Chapter 15. The data are observed lengths in 
CPU seconds between successive failures that occur during the execution of 
a certain real-time command. Typically, in a situation like this, in a small 
time interval, either 0 or 1 failure occurs. Moreover, failures occur with small 
probability and in disjoint time intervals failures occur independent of each 
other. In addition, let us assume that the rate at which the failures occur 
is constant over time. According to Chapter 12, this justifies the choice of 
a Poisson process to model the series of failures. From the properties of the 
Poisson process we know that the interfailure times are independent and have 
the same exponential distribution. Hence we model the software data as the 
realization of a random sample from an exponential distribution. 


17.1 Random samples and statistical models 247 


In some cases we may not be able to specify the type of distribution. Take, for 
instance, the Old Faithful data consisting of observed durations of eruptions 
of the Old Faithful geyser. Due to lack of specific geological knowledge about 
the subsurface and the mechanism that governs the eruptions, we prefer not to 
assume a particular type of distribution. However, we do model the durations 
as the realization of a random sample from a continuous distribution on (0, 00). 


In each of the three examples the dataset was obtained from repeated mea- 
surements performed under the same experimental conditions. The basic sta- 
tistical model for such a dataset is to consider the measurements as a random 
sample and to interpret the dataset as the realization of the random sample. 
Knowledge about the phenomenon under study and the nature of the experi- 
ment may lead to partial specification of the probability distribution of each 
X; in the sample. This should be included in the model. 


STATISTICAL MODEL FOR REPEATED MEASUREMENTS. A dataset 
consisting of values 71, %2,...,%n of repeated measurements of the 
same quantity is modeled as the realization of a random sample 
X1,X2,...,Xn. The model may include a partial specification of 
the probability distribution of each X;. 


The probability distribution of each X; is called the model distribution. Usu- 
ally it refers to a collection of distributions: in the Old Faithful example to 
the collection of all continuous distributions on (0,00), in the software ex- 
ample to the collection of all exponential distributions. In the latter case the 
parameter of the exponential distribution is called the model parameter. The 
unique distribution from which the sample actually originates is assumed to 
be one particular member of this collection and is called the “true” distribu- 
tion. Similarly, in the software example, the parameter corresponding to the 
“true” exponential distribution is called the “true” parameter. The word true 
is put between quotation marks because it does not refer to something in the 
real world, but only to a distribution (or parameter) in the statistical model, 
which is merely an approximation of the real situation. 


QUICK EXERCISE 17.2 We obtain a dataset of ten elements by tossing a coin 
ten times and recording the result of each toss. What is an appropriate sta- 
tistical model and corresponding model distribution for this dataset? 


Of course there are situations where the assumption of independence or identi- 
cal distributions is unrealistic. In that case a different statistical model would 
be more appropriate. However, we will restrict ourselves mainly to the case 
where the dataset can be modeled as the realization of a random sample. 
Once we have formulated a statistical model for our dataset, we can use the 
dataset to infer knowledge about the model distribution. Important questions 
about the corresponding model distribution are 


248 17 Basic statistical models 


e which feature of the model distribution represents the quantity of interest 
and how do we use our dataset to determine a value for this? 


e which model distribution fits a particular dataset best? 


These questions can be diverse, and answering them may be difficult. For 
instance, the Old Faithful data are modeled as a realization of a random 
sample from a continuous distribution. Suppose we are interested in a complete 
characterization of the “true” distribution, such as the distribution function 
F or the probability density f. Since there are no further specifications about 
the type of distribution, our problem would be to estimate the complete curve 
of F or f on the basis of our dataset. 

On the other hand, the software data are modeled as the realization of a 
random sample from an exponential distribution. In that case F' and f are 
completely characterized by a single parameter A: 


F(x) =1-e-%* and f(z)= de ** forx>0. 


Even if we are interested in the curves of F and f, our problem would reduce 
to estimating a single parameter on the basis of our dataset. 

In other cases we may not be interested in the distribution as a whole, but 
only in a specific feature of the model distribution that represents the quantity 
of interest. For instance, in a physical experiment, such as the one performed 
by Michelson, one usually thinks of each measurement as 


measurement = quantity of interest + measurement error. 


The quantity of interest, in this case the speed of light, is thought of as being 
some (unknown) constant and the measurement error is some random fluc- 
tuation. In the absence of systematic error, the measurement error can be 
modeled by a random variable with zero expectation and finite variance. In 
that case the measurements are modeled by a random sample from a distribu- 
tion with some unknown expectation and finite variance. The speed of light is 
represented by the expectation of the model distribution. Our problem would 
be to estimate the expectation of the model distribution on the basis of our 
dataset. 


In the remaining chapters, we will develop several statistical methods to infer 
knowledge about the “true” distribution or about a specific feature of it, by 
means of a dataset. In the remainder of this chapter we will investigate how 
the graphical and numerical summaries of our dataset can serve as a first 
indication of what an appropriate choice would be for this distribution or for 
a specific feature, such as its expectation. 


17.2 Distribution features and sample statistics 


In Chapters 15 and 16 we have discussed several empirical summaries of 
datasets. They are examples of numbers, curves, and other objects that are a 


17.2 Distribution features and sample statistics 249 


function 

h(a1,@2,.-.,;2n) 
of the dataset 21, 2%2,...,2, only. Since datasets are modeled as realizations 
of random samples X1, X2,..., Xn, an object h(a, x2,...,%n) is a realization 


of the corresponding random object 
h(Xy, X2,...,Xn). 


Such an object, which depends on the random sample X,, X2,..., Xn only, is 
called a sample statistic. 


If a statistical model adequately describes the dataset at hand, then the sample 
statistics corresponding to the empirical summaries should somehow reflect 
corresponding features of the model distribution. We have already seen a 
mathematical justification for this in Chapter 13 for the sample statistic 

x, = Kit Xap + Xn 

n 
based on a sample X,, X2,...,X», from a probability distribution with expec- 
tation yw. According to the law of large numbers, 
lim P(|X, — ul] >) =0 

n— co 
for every ¢ > 0. This means that for large sample size n, the sample mean 
of most realizations of the random sample is close to the expectation of the 
corresponding distribution. In fact, all sample statistics discussed in Chap- 
ters 15 and 16 are close to corresponding distribution features. To illustrate 
this we generate an artificial dataset from a normal distribution with pa- 
rameters yp = 5 and o = 2, using a technique similar to the one described 
in Section 6.2. Next, we compare the sample statistics with corresponding 
features of this distribution. 


The empirical distribution function 


Let X1, X2,...,Xy bea random sample from distribution function F’, and let 


Higi= number of X; in (—oo, a] 


n 
be the empirical distribution function of the sample. Another application of 
the law of large numbers (see Exercise 13.7) yields that for every ¢ > 0, 


lim P(|F,(a) — F(a)| > e) =0. 


n—oco 


This means that for most realizations of the random sample the empirical 
distribution function F;, is close to F: 


F,,(a) © F(a). 


250 17 Basic statistical models 


1.0 


1.0 pes 


0.8 - 0.8 f 


0.6 = 0.6 / 


0.4 + 0.4 j 


= A 
0.2 a 0.2 f 
ra é 
a f 
0.0 0.0 4 —— 
CT T TT CT T TT 41 CT T TT TT TT 41 
—2 0 2 4 6 8 10 12 —2 0 2 4 6 8 10 12 


Fig. 17.1. Empirical distribution functions of normal samples. 


Hence the empirical distribution function of the normal dataset should resem- 
ble the distribution function 


. 1 1(#—-5)\2 
F = —— -3( z ) d 
(2) [. Jon - 


of the N(5,4) distribution, and the fit should become better as the sample size 
n increases. An illustration of this can be found in Figure 17.1. We displayed 
the empirical distribution functions of datasets generated from an N(5, 4) 
distribution together with the “true” distribution function F’ (dotted lines), 
for sample sizes n = 20 (left) and n = 200 (right). 


The histogram and the kernel density estimate 


Suppose the random sample Xj, X2,...,X, is generated from a continuous 
distribution with probability density f. In Section 13.4 we have seen yet an- 
other consequence of the law of large numbers: 


number of X; in (x —h,x+h 
2hn 


When (a —h,x+Ah] is a bin of a histogram of the random sample, this means 
that the height of the histogram approximates the value of f at the midpoint 
of the bin: 


height of the histogram on (1x —h,x +h] ® f(x). 


Similarly, the kernel density estimate of a random sample approximates the 
corresponding probability density f: 


fnn(a) © f(2). 


17.2 Distribution features and sample statistics 251 


0.3 0.3 


0.2 


0.1 


0.0 


Fig. 17.2. Histogram and kernel density estimate of a sample of size 200. 


So the histogram and kernel density estimate of the normal dataset should 
resemble the graph of the probability density 


# 


f(a) = ae 
x) = ——e 

2/27 
of the N(5,4) distribution. This is illustrated in Figure 17.2, where we dis- 
played a histogram and a kernel density estimate of our dataset consisting of 
200 values generated from the N(5,4) distribution. It should be noted that 
with asmaller dataset the similarity can be much worse. This is demonstrated 
in Figure 17.3, which is based on the dataset consisting of 20 values generated 
from the same distribution. 


0.3 0.3 


0.2 


0.2 


0.1 


0.1 


0.0 0.0 


Fig. 17.3. Histogram and kernel density estimate of a sample of size 20. 


252 17 Basic statistical models 


Remark 17.1 (About the approximations). Let H,, be the height of 
the histogram on the interval (2 — h,x +h], which is assumed to be a bin of 
the histogram. Direct application of the law of large numbers merely yields 


that H, converges to 
1 [- 
— f(u) du. 
2h a—h 


Only for small h this is close to f(a). However, if we let h tend to 0 as n 
increases, a variation on the law of large numbers will guarantee that Hy 
converges to f(x): for every « > 0, 

lim P(|Hn — f(a)| >¢) =0. 


n—oCo 


A possible choice is the optimal bin width mentioned in Remark 15.1. Sim- 
ilarly, direct application of the law of large numbers yields that a kernel 
density estimator with fixed bandwidth h converges to 


l- f(a + hu) K(u) du. 


Once more, only for small h this is close to f(x), provided that K is sym- 
metric and integrates to one. However, by letting the bandwidth h tend 
to 0 as n increases, yet another variation on the law of large numbers will 
guarantee that fn.,(x) converges to f(a): for every € > 0, 


lim P(\fu,n(e) — f(e)| >) =0. 


noo 


A possible choice is the optimal bandwidth mentioned in Remark 15.2. 


The sample mean, the sample median, and empirical quantiles 


As we saw in Section 5.5, the expectation of an N(j,07) distribution is pu; 
so the N(5,4) distribution has expectation 5. According to the law of large 
numbers: X,, © p. This is illustrated by our dataset of 200 values generated 
from the N(5,4) distribution for which we find 


F209 = 5.012. 
For the sample median we find 
Med(21, eae , £200) = 5.018. 


This illustrates the fact that the sample median of a random sample from 
F approximates the median qgo.5 = F™Y(0.5). In fact, we have the following 
general property for the pth empirical quantile: 


gulp) SF (p) = gp. 


In the special case of the N(, 07) distribution, the expectation and the me- 
dian coincide, which explains why the sample mean and sample median of the 
normal dataset are so close to each other. 


17.3 Estimating features of the “true” distribution 253 


The sample variance and standard deviation, and the MAD 


As we saw in Section 5.5, the standard deviation and variance of an N(, 07) 
distribution are o and 07; so for the N(5,4) distribution these are 2 and 4. 
Another consequence of the law of large numbers is that 


Ss xo? and S, xo. 
This is illustrated by our normal dataset of size 200, for which we find 
S399 = 4.761 and  sao9 = 2.182 


for the sample variance and sample standard deviation. 


For the MAD of the dataset we find 1.334, which clearly differs from the 
standard deviation 2 of the N(5,4) distribution. The reason is that 


MADR s Koyiciey May PO" 75) = FPS), 


for any distribution that is symmetric around its median FY (0.5). For the 
N(5,4) distribution F (0.75) — F™Y(0.5) = 2©™°(0.75) = 1.3490, where 
® denotes the distribution function of the standard normal distribution (see 
Exercise 17.10). 


Relative frequencies 


For continuous distributions the histogram and kernel density estimates of a 
random sample approximate the corresponding probability density f. For dis- 
crete distributions we would like to have a sample statistic that approximates 
the probability mass function. In Section 13.4 we saw that, as a consequence 
of the law of large numbers, relative frequencies based on a random sample ap- 
proximate corresponding probabilities. As a special case, for a random sample 
X 1, X2,...,X» from a discrete distribution with probability mass function p, 
one has that 

number of X; equal to a 

i = p(a). 
This means that the relative frequency of a’s in the sample approximates 
the value of the probability mass function at a. Table 17.2 lists the sample 
statistics and the corresponding distribution features they approximate. 


17.3 Estimating features of the “true” distribution 


In the previous section we generated a dataset of 200 elements from a proba- 
bility distribution, and we have seen that certain features of this distribution 
are approximated by corresponding sample statistics. In practice, the situa- 
tion is reversed. In that case we have a dataset of n elements that is modeled 
as the realization of a random sample with a probability distribution that is 
unknown to us. Our goal is to use our dataset to estimate a certain feature 
of this distribution that represents the quantity of interest. In this section we 
will discuss a few examples. 


254 17 Basic statistical models 


Table 17.2. Some sample statistics and corresponding distribution features. 


Sample statistic Distribution feature 
Graphical 

Empirical distribution function Fy, Distribution function F'’ 
Kernel density estimate fn,, and histogram Probability density f 
(Number of X; equal to a)/n Probability mass function p(a) 
Numerical 

Sample mean X, Expectation ju 

Sample median Med(X1, X2,..., Xn) Median go.5 = F''"’ (0.5) 

pth empirical quantile qn (p) 100pth percentile g, = F'"’ (p) 
Sample variance S? Variance o? 

Sample standard deviation Sy Standard deviation a 
MAD(X1, Xo,..., Xn) F'"* (0.75) — F'™*(0.5), for 


symmetric F’ 


The Old Faithful data 


We stick to the assumptions of Section 17.1: by lack of knowledge on this phe- 
nomenon we prefer not to specify a particular parametric type of distribution, 
and we model the Old Faithful data as the realization of a random sample of 
size 272 from a continuous probability distribution. From the previous section 
we know that the kernel density estimate and the empirical distribution func- 
tion of the dataset approximate the probability density f and the distribution 
function F' of this distribution. In Figure 17.4 a kernel density estimate (left) 
and the empirical distribution function (right) are displayed. Indeed, neither 
graph resembles the probability density function or distribution function of 
any of the familiar parametric distributions. Instead of viewing both graphs 


0.010 fn 1.0 
\ / \ 
/ 
0.008 i 0.8 
0.006 \ 0.6 
0.004 \ 0.4 


0.002 \ \ 0.2 
Pf 


60 120 180 240 300 360 60 120 180 240 300 360 


Fig. 17.4. Nonparametric estimates for f and F’ based on the Old Faithful data. 


17.3 Estimating features of the “true” distribution 255 


only as graphical summaries of the data, we can also use both curves as esti- 
mates for f and F’. We estimate the model probability density f by means of 
the kernel density estimate and the model distribution function F’ by means 
of the empirical distribution function. Since neither estimate assumes a par- 
ticular parametric model, they are called nonparametric estimates. 


The software data 


Next consider the software reliability data. As motivated in Section 17.1, 
we model interfailure times as the realization of a random sample from an 
exponential distribution. To see whether an exponential distribution is indeed 
a reasonable model, we plot a histogram and a kernel density estimate using 
a boundary kernel in Figure 17.5. 


0.0015 0.0015 
0.0010 0.0010 
0.0005 0.0005 
0 0 

0 2000 4000 6000 8000 0 2000 4000 6000 8000 


Fig. 17.5. Histogram and kernel density estimate for the software data. 


Both seem to corroborate the assumption of an exponential distribution. Ac- 
cepting this, we are left with estimating the parameter \. Because for the 
exponential distribution E[X] = 1/A, the law of large numbers suggests 1/% 
as an estimate for A. For our dataset Z = 656.88, which yields 1/% = 0.0015. 
In Figure 17.6 we compare the estimated exponential density (left) and dis- 
tribution function (right) with the corresponding nonparametric estimates. 
Note that the nonparametric estimates do not assume an exponential model 
for the data. But, if an exponential distribution were the right model, the 
kernel density estimate and empirical distribution function should resemble 
the estimated exponential density and distribution function. At first sight the 
fit seems reasonable, although near zero the data accumulate more than one 
might perhaps expect for a sample of size 135 from an exponential distri- 
bution, and the other way around at the other end of the data range. The 
question is whether this phenomenon can be attributed to chance or is caused 
by the fact that the exponential model is the wrong model. We will return to 
this type of question in Chapter 25 (see also Chapter 18). 


256 17 Basic statistical models 


0.0025 1.0 


0.0020 0.8 
0.0015 0.6 
0.0010 0.4 
0.0005 0.2 
ot: mamas 0.0 4! 
0 2000 4000 6000 ~—-8000 0 2000 4000 6000 ~—-8000 


Fig. 17.6. Kernel density estimate and empirical cdf for software data (solid) com- 
pared to f and F of the estimated exponential distribution. 


Michelson data 


Consider the Michelson data on the speed of light. In this case we are not 
particularly interested in estimation of the “true” distribution, but solely in 
the expectation of this distribution, which represents the speed of light. The 
law of large numbers suggests to estimate the expectation by the sample 
mean Z, which equals 852.4. 


17.4 The linear regression model 


Recall the example about predicting Janka hardness of wood from the density 
of the wood in Section 15.5. The idea is, of course, that Janka hardness is 
related to the density: the higher the density of the wood, the higher the 
value of Janka hardness. This suggests a relationship of the type 


hardness = g(density of timber) 


for some increasing function g. This is supported by the scatterplot of the data 
in Figure 17.7. A closer look at the bivariate dataset in Table 15.5 suggests 
that randomness is also involved. For instance, for the value 51.5 of the density, 
different corresponding values of Janka hardness were observed. One way to 
model such a situation is by means of a regression model: 


hardness = g(density of timber) + random fluctuation. 


The important question now is what sort of function g fits well to the points 
in the scatterplot? 

In general, this may be a difficult question to answer. We may have so little 
knowledge about the phenomenon under study, and the data points may be 


17.4 The linear regression model 257 


Hardness 
e 
e 


20 30 40 50 60 70 80 
Wood density 


Fig. 17.7. Scatterplot of Janka hardness versus wood density. 


scattered in such a way, that there is no reason to assume a specific type of 
function for g. However, for the Janka hardness data it makes sense to assume 
that g is increasing, but this still leaves us with many possibilities. Looking at 
the scatterplot, at first sight it does not seem unreasonable to assume that g is 
a straight line, i.e., Janka hardness depends linearly on the density of timber. 
The fact that the points are not exactly on a straight line is then modeled by 
a random fluctuation with respect to the straight line: 


hardness = a + ( - (density of timber) + random fluctuation. 


This is a loose description of a simple linear regression model. A more complete 
description is given below. 


SIMPLE LINEAR REGRESSION MODEL. In a simple linear regression 


model for a bivariate dataset (21, y1), (@2, y2),---,(@n; Yn); We as- 
sume that 21, %2,...,%p» are nonrandom and that yj, yo,.-.,Yn are 
realizations of random variables Y;, Y2,..., Y, satisfying 


Wa = Gh bee ty oi? 6S Poo cag ity 


where Uj,...,Un are independent random variables with E[U;] = 0 
and Var(U;) = 07. 


The line y = a+ (x is called the regression line. The parameters a and 
represent the intercept and slope of the regression line. Usually, the 2-variable 
is called the explanatory variable and the y-variable is called the response 
variable. One also refers to x and y as independent and dependent variables. 
The random variables U;, U2,...,U;, are assumed to be independent when the 
different measurements do not influence each other. They are assumed to have 


258 17 Basic statistical models 


expectation zero, because the random fluctuation is considered to be around 
the regression line y = a+ Ga. Finally, because each random fluctuation 
is supposed to have the same amount of variability, we assume that all U; 
have the same variance. Note that by the propagation of independence rule 
in Section 9.4, independence of the U; implies independence of Y;. However, 
Y1, Y2,..-,Yn do not form a random sample. Indeed, the Y; have different 
distributions because every Y; has a different expectation 


E[Yi] = Ela + Br; + Ui] =a4+ Ba; + E[Ui] = a+ Baj. 


QUICK EXERCISE 17.3 Consider the simple linear regression model as defined 
earlier. Compute the variance of Y;. 


The parameters a and 7 are unknown and our task will be to estimate them on 
the basis of the data. We will come back to this in Chapter 22. In Figure 17.8 
the scatterplot for the Janka hardness data is displayed with the estimated 


Hardness 


20 30 40 50 60 70 80 
Wood density 


Fig. 17.8. Estimated regression line for the Janka hardness data. 


regression line 
y = —1160.5+ 57.512. 


Taking a closer look at Figure 17.8, you might wonder whether 
y=at Prt yx? 


would be a more appropriate model. By trying to answer this question we 
enter the area of multiple linear regression. We will not pursue this topic; we 
restrict ourselves to simple linear regression. 


17.6 Exercises 259 


17.5 Solutions to the quick exercises 


17.1 Because X,, X2 form a random sample, they are independent. Using 
the rule about the variance of the sum of independent random variables, this 
means that Var(X, + X2) = Var(X,) + Var(X2) =1+4+1=2. 


17.2 The result of each toss of a coin can be modeled by a Bernoulli random 
variable taking values 1 (heads) and 0 (tails). In the case when it is known 
that we are tossing a fair coin, heads and tails occur with equal probability. 
Since it is reasonable to assume that the tosses do not influence each other, 
the outcomes of the ten tosses are modeled as the realization of a random 
sample X1,...,Xi9 from a Bernoulli distribution with parameter p = 1/2. In 
this case the model distribution is completely specified and coincides with the 
“true” distribution: a Ber(4) distribution. 

In the case when we are dealing with a possibly unfair coin, the outcomes 
of the ten tosses are still modeled as the realization of a random sample 
X1,...,Xi9 from a Bernoulli distribution, but we cannot specify the value 
of the parameter p. The model distribution is a Bernoulli distribution. The 
“true” distribution is a Bernoulli distribution with one particular value for p, 
unknown to us. 


17.3 Note that the x; are considered nonrandom. By the rules for the vari- 
ance, we find Var(Y;) = Var(a + Ga; + U;) = Var(U;) = o?. 


17.6 Exercises 


17.1 H Figure 17.9 displays several histograms, kernel density estimates, and 
empirical distribution functions. It is known that all figures correspond to 
datasets of size 200 that are generated from normal distributions N(0, 1), 
N(0,9), and N(3, 1), and from exponential distributions Exp(1) and Exp (1/3). 
Report for each figure from which distribution the dataset has been generated. 


17.2 4 Figure 17.10 displays several boxplots. It is known that all figures 
correspond to datasets of size 200 that are generated from the same five dis- 
tributions as in Exercise 17.1. Report for each boxplot from which distribution 
the dataset has been generated. 


17.3 & At a London underground station, the number of women was counted 
in each of 100 queues of length 10. In this way a dataset x1, 22,...,2 09 was 
obtained, where x; denotes the observed number of women in the ith queue. 
The dataset is summarized in the following table and lists the number of 
queues with 0 women, 1 woman, 2 women, etc. 


260 


Dataset 1 


0.4 
0.3 
0.2 
0.1 
0.0 


0 2 4 6 
Dataset 4 


0.5 
0.4 
0.3 
0.2 
0.1 
0.0 _{———* 


Dataset 7 


0.4 
0.3 
0.2 
0.1 
0.0 


-4-2 0 2 4 6 


Dataset 10 


(0) 2 4 6 
Dataset 13 


1.0 
0.8 
0.6 
0.4 
0.2 
0.0 


17 Basic statistical models 


Dataset 2 


—2 0 2 

Dataset 5 

1.0 —————— 
we 

0.8 

0.6 

0.4 

0.2 

0.0 

0 5 10 15 

Dataset 8 


0.15 


oO 
a 
[is 


0.05 
0.00 
=—6-3 0 3 6 8 
Dataset 11 
0.12 \ 
0.09 
0.06 \ 
0.03 \O 
0.00 = s 
—12 -6 O 6 12 
Dataset 14 


0.3 


0.2 


0.1 


0.0 


a 


BR 
o 
BR 
on 
i) 
i=) 
iw) 
on 


2 
bo 


> = 
e No 


roe =) 
on o 


a 


geeo eo ¢ 
oN BRD & 


2 
a 


> 
Ne 


2 
iy 


0.0 


SS oS. i 
Rk OD @ Oo 


o ° 
own’ 


Dataset 3 


| 
BS 

| 
iw) 
oO 
bo 
iN 


Dataset 6 


i=} 
iw) 


4 6 8 
Dataset 9 


o 
iw) 


4 6 
Dataset 12 


oe) 


°o 
ao 
Hb 
o 
a 
ou 
Li) 
o 


Dataset 15 


Oo 
wo 
is 
a 


Fig. 17.9. Graphical representations of different datasets from Exercise 17.1. 


17.6 Exercises 261 


Boxplot 1 Boxplot 2 Boxplot 3 


15 


10 


0 
Boxplot 4 Boxplot 5 Boxplot 6 
3 6 2 8 
2 — 
6 
1 4 
0 4 
-1 2 
2 
—2 
—3 0 0 
Boxplot 7 Boxplot 8 
6 
6 
3 4 
0 
=3 2 
—6 
0 
—9 
Boxplot 11 Boxplot 12 
6 
6 
4 
4 
2 2 
0 0 
Boxplot 13 Boxplot 14 Boxplot 15 


Fig. 17.10. Boxplot of different datasets from Exercise 17.2. 


262 17 Basic statistical models 


Count 0 
1 


1 2 
Frequency 3.4 23 25 19 1 


Source: R.A. Jinkinson and M. Slater. Critical discussion of a graphical 
method for identifying discrete distributions. The Statistician, 30:239-248, 
1981; Table 1 on page 240. 


In the statistical model for this dataset, we assume that the observed counts 
are a realization of a random sample Xj, X2,..., X00. 


a. Assume that people line up in such a way that a man or woman in a 
certain position is independent of the other positions, and that in each 
position one has a woman with equal probability. What is an appropriate 
choice for the model distribution? 


b. Use the table to find an estimate for the parameter(s) of the model dis- 
tribution chosen in part a. 


17.4 During the Second World War, London was hit by numerous flying 
bombs. The following data are from an area in South London of 36 square 
kilometers. The area was divided into 576 squares with sides of length 1/4 
kilometer. For each of the 576 squares the number of hits was recorded. In 
this way we obtain a dataset 71, %2,...,%576, where x; denotes the number of 
hits in the 7th square. The data are summarized in the following table which 
lists the number of squares with no hits, 1 hit, 2 hits, etc. 


Number of hits 0 1 2 @ A 
7 


4 
Number of squares 229 211 93 35 1 


6 
0 
Source: R.D. Clarke. An application of the Poisson distribution. Journal of 
the Institute of Actuaries, 72:48, 1946; Table 1 on page 481. © Faculty and 

Institute of Actuaries. 
An interesting question is whether London was hit in a completely random 
manner. In that case a Poisson distribution should fit the data. 


a. If we model the dataset as the realization of a random sample from a 
Poisson distribution with parameter jz, then what would you choose as an 
estimate for ju? 

b. Check the fit with a Poisson distribution by comparing some of the ob- 
served relative frequencies of 0’s, 1’s, 2’s, etc., with the corresponding 
probabilities for the Poisson distribution with y estimated as in part a. 


17.5 FH We return to the example concerning the number of menstrual cycles 
up to pregnancy, where the number of cycles was modeled by a geometric 
random variable (see Section 4.4). The original data concerned 100 smoking 
and 486 nonsmoking women. For 7 smokers and 12 nonsmokers, the exact 
number of cycles up to pregnancy was unknown. In the following tables we only 


17.6 Exercises 263 


incorporated the 93 smokers and 474 nonsmokers, for which the exact number 
of cycles was observed. Another analysis, based on the complete dataset, is 
done in Section 21.1. 


a. Consider the dataset 71, 22,...,£%93 corresponding to the smoking women, 
where «; denotes the number of cycles for the ith smoking woman. The 
data are summarized in the following table. 


Cycles 1 2 3 45 67 
Frequency 29 16 17 4 3 9 4 
Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap- 
plied to comparative fecundability studies. Biometrics, 42(3):547—-560, 1986. 


The table lists the number of women that had to wait 1 cycle, 2 cycles, 
etc. If we model the dataset as the realization of a random sample from a 
geometric distribution with parameter p, then what would you choose as 
an estimate for p? 

b. Also estimate the parameter p for the 474 nonsmoking women, which 
is also modeled as the realization of a random sample from a geometric 
distribution. The dataset 41, y2,..., ya7a, Where y; denotes the number of 
cycles for the jth nonsmoking woman, is summarized here: 


Cycles 1 2 3 4 5 6 


7 8 9 10 11 12 
Frequency 198 107 55 38 18 22 7 9 5 


3. 6 6 


Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap- 
plied to comparative fecundability studies. Biometrics, 42(3):547—-560, 1986. 


You may use that yy + yo +-+:+ ya7a = 1285. 


c. Compare the estimates of the probability of becoming pregnant in three 
or fewer cycles for smoking and nonsmoking women. 


17.6 Recall Exercise 15.1 about the chest circumference of 5732 Scottish sol- 
diers, where we constructed the histogram displayed in Figure 17.11. The 
histogram suggests modeling the data as the realization of a random sample 
from a normal distribution. 


a. Suppose that for the dataset S~ a; = 228377.2 and > x? = 9124064. What 
would you choose as estimates for the parameters jz and o of the N(p, 07) 
distribution? 

Hint: you may want to use the relation from Exercise 16.15. 

b. Give an estimate for the probability that a Scottish soldier has a chest 

circumference between 38.5 and 42.5 inches. 


264 


17 Basic statistical models 


0.20 


0.10 


0 
P| 


32 34 36 38 40 42 44 46 48 50 


Fig. 17.11. Histogram of chest circumferences. 


Recall Exercise 15.3 about time intervals between successive coal mine 


17.7 


disasters. Let us assume that the rate at which the disasters occur is constant 
over time and that on a single day a disaster takes place with small probability 
independently of what happens on other days. According to Chapter 12 this 


sugges 


ts modeling the series of disasters with a Poisson process. Figure 17.12 


displays a histogram and empirical distribution function of the observed time 


intervals. 


a. In 
as 


the statistical model for this dataset we model the 190 time intervals 
the realization of a random sample. What would you choose for the 


model distribution? 
b. The sum of the observed time intervals is 40549 days. Give an estimate 
for the parameter(s) of the distribution chosen in part a. 


0.003 


0.002 


0.001 


1.0 —_— 


Uo 
Pol 
f 
0.8 ! 
. 
j 
0.6 7 
i 
044 } 
| 
0.2 i 
0.0 : 
CT +. J TT 1 CT +.- T  +T 1 
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 


Fig. 17.12. Histogram of time intervals between successive disasters. 


17.6 Exercises 265 


17.8 The following data represent the number of revolutions to failure (in 
millions) of 22 deep-groove ball-bearings. 


17.88 28.92 33.00 41.52 42.12 

45.60 48.48 51.84 51.96 54.12 

55.56 67.80 68.64 68.88 84.12 

93.12 98.64 105.12 105.84 127.92 
128.04 173.40 


Source: J. Lieblein and M. Zelen. Statistical investigation of the fatigue-life 
of deep-groove ball-bearings. Journal of Research, National Bureau of Stan- 
dards, 57:273-316, 1956; specimen worksheet on page 286. 


Lieblein and Zelen propose modeling the dataset as a realization of a random 
sample from a Weibull distribution, which has distribution function 


F(x) =1-—e°9"" for x > 0, 
and F(a) = 0, for « < 0, where a, A > 0. 


a. Suppose that X is a random variable with a Weibull distribution. Check 
that the random variable Y = X% has an exponential distribution with 
parameter A® and conclude that E[X°] = 1/X*. 


b. Use part a to explain how one can use the data in the table to find 
an estimate for the parameter \, if it is given that the parameter a is 
estimated by 2.102. 


17.9 H The volume (i.e., the effective wood production in cubic meters), 
height (in meters), and diameter (in meters) (measured at 1.37 meter above 
the ground) are recorded for 31 black cherry trees in the Allegheny National 
Forest in Pennsylvania. The data are listed in Table 17.3. They were collected 
to find an estimate for the volume of a tree (and therefore for the timber 
yield), given its height and diameter. For each tree the volume y and the 
value of x = d*h are recorded, where d and h are the diameter and height 
of the tree. The resulting points (a1, y1),..., (#31, y31) are displayed in the 
scatterplot in Figure 17.13. 


We model the data by the following linear regression model (without intercept) 
Yj; = Bai + Ui 
for i= 1,2,...,31. 


a. What physical reasons justify the linear relationship between y and d7h? 
Hint: how does the volume of a cylinder relate to its diameter and height? 


b. We want to find an estimate for the slope @ of the line y = Gx. Two 
natural candidates are the average slope Z,, where z; = y;/2;, and the 


266 17 Basic statistical models 


Table 17.3. Measurements on black cherry trees. 


Diameter Height Volume 


0.21 21.3 0.29 
0.22 19.8 0.29 
0.22 19.2 0.29 
0.27 21.9 0.46 
0.27 24.7 0.53 
0.27 25.3 0.56 
0.28 20.1 0.44 
0.28 22.9 0.52 
0.28 24.4 0.64 
0.28 22.9 0.56 
0.29 24.1 0.69 
0.29 23.2 0.59 
0.29 23.2 0.61 
0.30 21.0 0.60 
0.30 22.9 0.54 
0.33 22.6 0.63 
0.33 25.9 0.96 
0.34 26.2 0.78 
0.35 21.6 0.73 
0.35 19.5 0.71 
0.36 23.8 0.98 
0.36 24.4 0.90 
0.37 22.6 1.03 
0.41 21.9 1.08 
0.41 23.5 1.21 
0.44 24.7 1.57 
0.44 25.0 1.58 
0.45 24.4 1.65 
0.46 24.4 1.46 
0.46 24.4 1.44 
0.52 26.5 2.18 


Source: A.C. Atkinson. Regression diagnostics, trend formations and con- 
structed variables (with discussion). Journal of the Royal Statistical Society, 
Series B, 44:1—36, 1982. 


slope of the averages 9/Z. In Chapter 22 we will encounter the so-called 
least squares estimate: 
n 
tiv 
i=1 


——.. 
2 

v; 

i=1 


17.6 Exercises 267 


2.5 
2.0 
1.5 e 


1.0 7? 


0.0 
0 2 4 6 8 


Fig. 17.13. Scatterplot of the black cherry tree data. 


Compute all three estimates for the data in Table 17.3. You need at least 
5 digits accuracy, and you may use that > 2; = 87.456, )> y; = 26.486, 
YS yi/ai = 9.369, > xiy; = 95.498, and Sx? = 314.644. 


17.10 Let X be a random variable with (continuous) distribution function F’. 
Let m = qo.5 = F'™Y(0.5) be the median of F and define the random variable 


Y = |X —m|. 
a. Show that Y has distribution function G, defined by 
Gy) = F(m+y) — F(m—y). 


b. The MAD of F is the median of G. Show that if the density f correspond- 
ing to F is symmetric around its median m, then 


G(y) =2F(m+y)-1 


and derive that 
G(s) = as _ Pers). 


c. Use b to conclude that the MAD of an N(,07) distribution is equal to 
o&™’ (3/4), where ® is the distribution function of a standard normal 
distribution. Recall that the distribution function F of an N(j,07) can 


be written as 
Fe) =0 (4). 
o 


You might check that, as stated in Section 17.2, the MAD of the N(5,4) 
distribution is equal to 26'"*(3/4) = 1.3490. 


268 17 Basic statistical models 


17.11 In this exercise we compute the MAD of the Ezp(A) distribution. 


a. Let X have an Exp(A) distribution, with median m = (In 2)/A. Show that 
Y = |X —™m| has distribution function 


(XY —e7*~¥). 


b. Argue that the MAD of the Exp(,) distribution is a solution of the equa- 
tion e*¥ —2¥ —-1=0. 


c. Compute the MAD of the Exp(,) distribution. 
Hint: put « = e*” and first solve for z. 


18 


The bootstrap 


In the forthcoming chapters we will develop statistical methods to infer knowl- 
edge about the model distribution and encounter several sample statistics to 
do this. In the previous chapter we have seen examples of sample statistics 
that can be used to estimate different model features, for instance, the em- 
pirical distribution function to estimate the model distribution function F, 
and the sample mean to estimate the expectation jz corresponding to F’. One 
of the things we would like to know is how close a sample statistic is to the 
model feature it is supposed to estimate. For instance, what is the probability 
that the sample mean and yp differ more than a given tolerance ¢? For this 
we need to know the distribution of X,, — pu. More generally, it is important 
to know how a sample statistic is distributed in relation to the corresponding 
model feature. For the distribution of the sample mean we saw a normal limit 
approximation in Chapter 14. In this chapter we discuss a simulation proce- 
dure that approximates the distribution of the sample mean for finite sample 
size. Moreover, the method is more generally applicable to sample statistics 
other than the sample mean. 


18.1 The bootstrap principle 


Consider the Old Faithful data introduced in Chapter 15, which we modeled 
as the realization of a random sample of size n = 272 from some distribution 
function F’. The sample mean 2,, of the observed durations equals 209.3. What 
does this say about the expectation py of f°? As we saw in Chapter 17, the value 
209.3 is a natural estimate for jz, but to conclude that p is equal to 209.3 is 
unwise. The reason is that, if we would observe a new dataset of durations, we 
will obtain a different sample mean as an estimate for yp. This should not come 
as a surprise. Since the dataset x1, 2%2,...,2%p is just one possible realization 
of the random sample Xj, X2,...,Xn, the observed sample mean is just one 
possible realization of the random variable 


270 18 The bootstrap 


_ Xt te 
x= y+ AQq+t + 
n 


A new dataset is another realization of the random sample, and the cor- 
responding sample mean is another realization of the random variable Xp. 
Hence, to infer something about 4, one should take into account how realiza- 
tions of X,, vary. This variation is described by the probability distribution 
of Xp. 

In principle! it is possible to determine the distribution function of X,, from 
the distribution function F of the random sample X,, X2,...,Xn. However, 
F is unknown. Nevertheless, in Chapter 17 we saw that the observed dataset 
reflects most features of the “true” probability distribution. Hence the natural 
thing to do is to compute an estimate F for the distribution function F and 
then to consider a random sample from F and the corresponding sample mean 
as substitutes for the random sample X), X2,...,X, from F and the random 
variable X,,. A random sample from F is called a bootstrap random sample, 
or briefly bootstrap sample, and is denoted by 


* * * 
X7,X9,---, Xp 


to distinguish it from the random sample X1, Xo,...,X,, from the “true” F. 
The corresponding average is called the bootstrapped sample mean, and this 
random variable is denoted by 

co Xft XZ+:-:+XF 


ue n 


to distinguish it from the random variable Xn. The idea is now to use the 
distribution of X; to approximate the distribution of X,,. 


The preceding procedure is called the bootstrap principle for the sample mean. 
Clearly, it can be applied to any sample statistic h(X1, Xo,..., Xn) by approx- 
imating its probability distribution by that of the corresponding bootstrapped 
sample statistic h(X7, X3,...,X;). 


BOOTSTRAP PRINCIPLE. Use the dataset 21,%2,...,% to com- 
pute an estimate F for the “true” distribution function F. Replace 
the random sample X1, X2,...,X, from F by a random sample 
XG > AGigocog Dhey WFO F, and approximate the probability distribu- 
tion of h(Xy, X2,..., Xn) by that of h(X}, X3,..., X;). 


Returning to the sample mean, the first question that comes to mind is, of 
course, how well does the distribution of X; approximate the distribution 


' In Section 11.1 we saw how the distribution of the sum of independent random 
variables can be computed. Together with the change-of-units rule (see page 106), 
the distribution of X, can be determined. See also Section 13.1, where this is done 
for independent Gam (2,1) variables. 


18.1 The bootstrap principle 271 


of X,,? Or more generally, how well does the distribution of a bootstrapped 
sample statistic h(X}, X3,...,X;,) approximate the distribution of the sam- 
ple statistic of interest h(X1, X2,...,X»)? Applied in such a straightforward 
manner, the bootstrap approximation for the distribution of X, by that of 
X* may not be so good (see Remark 18.1). The bootstrap approximation will 
improve if we approximate the distribution of the centered sample mean: 


Xn — HL, 


where ys is the expectation corresponding to F. The bootstrapped version 
would be the random variable 


Xp eae 


where j* is the expectation corresponding to F. Often the bootstrap approx- 
imation of the distribution of a sample statistic will improve if we somehow 
normalize the sample statistic by relating it to a corresponding feature of the 
“true” distribution. An example is the centered sample median 


Med(X1, Xo,...,Xn) — F™"(0.5), 


where we subtract the median F™Y(0.5) of F. Another example is the nor- 
malized sample variance 
Sh 


o2” 
where we divide by the variance o? of F. 


QUICK EXERCISE 18.1 Describe how the bootstrap principle should be applied 
to approximate the distribution of Med(X1, X2,..., Xn) — F™"(0.5). 


Remark 18.1 (The bootstrap for the sample mean). To see why 
the bootstrap approximation for X, may be bad, consider a dataset 
%1,02,...,Xn that is a realization of a random sample X1, X2,..., Xn from 
an N(,1) distribution. In that case the corresponding sample mean Xp, 
has an N(y,1/n) distribution. We estimate by Z, and replace the ran- 
dom sample from an N(,1) distribution by a bootstrap random sample 
X7{,X3,...,X7; from an N(Zp,1) distribution. The corresponding boot- 
strapped sample mean X;, has an N(#n,1/n) distribution. Therefore the 
distribution functions G,, and G*, of the random variables X,, and x can 
be determined: 


Gn(a) = (Vni(a—p)) and G5(a) = O(Vn(a— an). 


In this case it turns out that the maximum distance between the two dis- 
tribution functions is equal to 


26 (4V/nlZn — pl) — 1. 


272 18 The bootstrap 


Since X, has an N(j1,1/n) distribution, this value is approximately equal to 

2® (|z|/2) —1, where z is a realization of an N(0,1) random variable Z. This 

only equals zero for z = 0, so that the distance between the distribution 

functions of X, and X* will almost always be strictly positive, even for 

large n. 
The question that remains is what to take as an estimate F for F. This 
will depend on how well F can be specified. For the Old Faithful data we 
cannot say anything about the type of distribution. However, for the software 
data it seems reasonable to model the dataset as a realization of a random 
sample from an Exp(A) distribution and then we only have to estimate the 
parameter \. Different assumptions about F' give rise to different bootstrap 
procedures. We will discuss two of them in the next sections. 


18.2 The empirical bootstrap 


Suppose we consider our dataset 71, %2,...,2%n as a realization of a random 
sample from a distribution function F’. When we cannot make any assumptions 
about the type of F’', we can always estimate F by the empirical distribution 
function of the dataset: 
F(a) =hia\= number of x; less than or equal to a 
n 

Since we estimate F’ by the empirical distribution function, the corresponding 
bootstrap principle is called the empirical bootstrap. Applying this principle 
to the centered sample mean, the random sample X,, X2,...,X, from F is 
replaced by a bootstrap random sample X/,X3,...,X; from F;,, and the 
distribution of X,, — 1 is approximated by that of X* — y*, where p* denotes 
the expectation corresponding to F;,. The question is, of course, how good 
this approximation is. A mathematical theorem tells us that the empirical 
bootstrap works for the centered sample mean, i.e., the distribution of X,— 
is well approximated by that of X*—j* (see Remark 18.2). On the other hand, 
there are (normalized) sample statistics for which the empirical bootstrap fails, 


such as . 
maximum of Xj, X2,...,Xn 


0 bf 
based on a random sample X1, X2,...,X» from a U(0,0) distribution (see 
Exercise 18.12). 


1 


Remark 18.2 (The empirical bootstrap for X,,—j:). For the centered 
sample mean the bootstrap approximation works, even if we estimate F' 
by the empirical distribution function F;,. If Gn denotes the distribution 
function of X, — and G* the distribution function of its bootstrapped 
version X* — p*, then the maximum distance between G*, and Gn goes to 
zero with probability one: 


18.2 The empirical bootstrap 273 


P( lim sup |G%(t) — Gn(t)| = 0) =1 
NO teR 

(see, for instance, Singh [82]). In fact, the empirical bootstrap approxima- 
tion can be improved by approximating the distribution of the standardized 
average \/n(Xn —1)/o by its bootstrapped version \/n(X;,— *)/o*, where 
o and o* denote the standard deviations of F and F,,. This approximation 
is even better than the normal approximation by the central limit theorem! 
See, for instance, Hall [14]. 


Let us continue with approximating the distribution of X,, — py by that of 
X* — p*. First note that the empirical distribution function F;, of the original 
dataset is the distribution function of a discrete random variable that attains 
the values 71, ¥%2,...,2%n, each with probability 1/n. This means that each of 
the bootstrap random variables X; has expectation 


a ae ee 
n n n 

Therefore, applying the empirical bootstrap to X,, — 2 means approximating 
its distribution by that of X* — z,. In principle it would be possible to deter- 
mine the probability distribution of X* — Z,,. Indeed, the random variable X* 
is based on the random variables X7, whose distribution we know precisely: 
it takes values 21, 22,...,%, with equal probability 1/n. Hence we could de- 
termine the possible values of X* — Z, and the corresponding probabilities. 
For small n this can be done (see Exercise 18.5), but for large n this becomes 
cumbersome. Therefore we invoke a second approximation. 


Recall the jury example in Section 6.3, where we investigated the variation 
of two different rules that a jury might use to assign grades. In terms of 
the present chapter, the jury example deals with a random sample from a 
U(—0.5,0.5) distribution and two different sample statistics T and M, cor- 
responding to the two rules. To investigate the distribution of JT and M, 
a simulation was carried out with one thousand runs, where in every run we 
generated a realization of a random sample from the U(—0.5,0.5) distribution 
and computed the corresponding realization of T and M. The one thousand 
realizations give a good impression of how T and M vary around the deserved 
score (see Figure 6.4). 
Returning to the distribution of X* —Z,, the analogue would be to repeatedly 
generate a realization of the bootstrap random sample from F,, and every time 
compute the corresponding realization of a —2,. The resulting realizations 
would give a good impression about the distribution of X* — Z,. A realization 
of the bootstrap random sample is called a bootstrap dataset and is denoted 
by 

Bip Boye L,, 
to distinguish it from the original dataset x71, %2,...,% . For the centered 
sample mean the simulation procedure is as follows. 


274 18 The bootstrap 


EMPIRICAL BOOTSTRAP SIMULATION (FOR X,—j1). Given a dataset 
1, %2,...,%, determine its empirical distribution function F,, as an 
estimate of F’, and compute the expectation 


é = dl =F GER AP O° 8 SP ba 
Lt — Ln SS 
nr 
corresponding to F;,. 
1. Generate a bootstrap dataset xj,x5,...,2% from F),. 
2. Compute the centered sample mean for the bootstrap dataset: 


where 


Repeat steps 1 and 2 many times. 


Note that generating a value x* from F, is equivalent to choosing one of the 
elements 21, 22,...,2n of the original dataset with equal probability 1/n. 


The empirical bootstrap simulation is described for the centered sample mean, 
but clearly a similar simulation procedure can be formulated for any (normal- 
ized) sample statistic. 


Remark 18.3 (Some history). Although Efron [7] in 1979 drew attention 
to diverse applications of the empirical bootstrap simulation, it already 
existed before that time, but not as a unified widely applicable technique. 
See Hall [14] for references to earlier ideas along similar lines and to further 
development of the bootstrap. One of Efron’s contributions was to point out 
how to combine the bootstrap with modern computational power. In this 
way, the interest in this procedure is a typical consequence of the influence of 
computers on the development of statistics in the past decades. Efron also 
coined the term “bootstrap,” which is inspired by the American version 
of one of the tall stories of the Baron von Miinchhausen, who claimed to 
have lifted himself out of a swamp by pulling the strap on his boot (in the 
European version he lifted himself by pulling his hair). 


QUICK EXERCISE 18.2 Describe the empirical bootstrap simulation for the 
centered sample median Med(Xq, X2,...,Xn) — F™Y(0.5). 


For the Old Faithful data we carried out the empirical bootstrap simulation 
for the centered sample mean with one thousand repetitions. In Figure 18.1 
a histogram (left) and kernel density estimate (right) are displayed of one 
thousand centered bootstrap sample means 


x 


= — — 3 
nil tn Inga7tn °°" Ln 1000 — Ln: 


18.2 The empirical bootstrap 275 


0.06 0.06 

0.04 0.04 

0.02 0.02 

0 0 
-18 -12 -6 0 6 12 18 -18 -12 -6 0 6 12 18 


Fig. 18.1. Histogram and kernel density estimate of centered bootstrap sample 
means. 


Since these are realizations of the random variable X* — Z,, we know from 
Section 17.2 that they reflect the distribution of x — Z,. Hence, as the dis- 
tribution of X* — %, approximates that of X, — pu, the centered bootstrap 
sample means also reflect the distribution of X,,—y. This leads to the following 
application. 


An application of the empirical bootstrap 


Let us return to our example about the Old Faithful data, which are mod- 
eled as a realization of a random sample from some F’. Suppose we estimate 
the expectation ~ corresponding to F' by Zp, = 209.3. Can we say how far 
away 209.3 is from the “true” expectation w? To be honest, the answer is 
no...(oops). In a situation like this, the measurements and their correspond- 
ing average are subject to randomness, so that we cannot say anything with 
absolute certainty about how far away the average will be from ju. One of the 
things we can say is how likely it is that the average is within a given distance 
from pL. 

To get an impression of how close the average of a dataset of nm = 272 ob- 
served durations of the Old Faithful geyser is to w, we want to compute the 
probability that the sample mean deviates more than 5 from p: 


P(|Xn — | > 5). 


Direct computation of this probability is impossible, because we do not know 
the distribution of the random variable X,, —ju. However, since the distribution 
of X* — %, approximates the distribution of X,, — 4, we can approximate the 
probability as follows 


P(|Xn — u| > 5) = P(|X; — Z| > 5) = P(|X% — 209.3] > 5), 


276 18 The bootstrap 


where we have also used that for the Old Faithful data, 7, = 209.3. As we 
mentioned before, in principle it is possible to compute the last probability 
exactly. Since this is too cumbersome, we approximate P(|X* — 209.3] > 5) 
by means of the one thousand centered bootstrap sample means obtained from 
the empirical bootstrap simulation: 


En — 209.3 Eno — 209.3 vers En,1000 — 209.3. 


In view of Table 17.2, a natural estimate for P( X* — 209.3| > 5) is the relative 
frequency of centered bootstrap sample means that are greater than 5 in 
absolute value: 
number of i with |Z}; — 209.3| greater than 5 

1000 ; 


For the centered bootstrap sample means of Figure 18.1, this relative fre- 
quency is 0.227. Hence, we obtain the following bootstrap approximation 


P(|Xn — p| > 5) & P(|X* — 209.3] > 5) = 0.227. 


It should be emphasized that the second approximation can be made ar- 
bitrarily accurate by increasing the number of repetitions in the bootstrap 
procedure. 


18.3 The parametric bootstrap 


Suppose we consider our dataset as a realization of a random sample from a 
distribution of a specific parametric type. In that case the distribution function 
is completely determined by a parameter or vector of parameters 6: F = Fo. 
Then we do not have to estimate the whole distribution function F’, but it 
suffices to estimate the parameter(vector) @ by 6 and estimate F' by 


BF =F. 


The corresponding bootstrap principle is called the parametric bootstrap. 


Let us investigate what this would mean for the centered sample mean. First 
we should realize that the expectation of F% is also determined by 6: w = slg. 
The parametric bootstrap for the centered sample mean now amounts to the 
following. The random sample X,, X2,...,X» from the “true” distribution 
function F% is replaced by a bootstrap random sample Xj, X3,...,X;7 from 
Fy, and the probability distribution of X,, — Wo is approximated by that of 
X;, — p*, where 

ML” = Mg 
denotes the expectation corresponding to F%4. 
Often the parametric bootstrap approximation is better than the empirical 
bootstrap approximation, as illustrated in the next quick exercise. 


18.3 The parametric bootstrap 277 


QUICK EXERCISE 18.3 Suppose the dataset x1, 2%2,...,%p is a realization of a 
random sample X,, X2,...,X» from an N(,1) distribution. Estimate ww by 
Z, and consider a bootstrap random sample X7, X3,...,X; from an N(Zp, 1) 


distribution. Check that the probability distributions of on p and xe —Zn 
are the same: an N(0,1/n) distribution. 


Once more, in principle it is possible to determine the distribution of X* — ih 
exactly. However, in contrast with the situation considered in the previous 
quick exercise, in some cases this is still cumbersome. Again a simulation 
procedure may help us out. For the centered sample mean the procedure is as 
follows. 


PARAMETRIC BOOTSTRAP SIMULATION (FOR X,, — 1). Given a 
dataset 71,22,...,2%n, compute an estimate 6 for @. Determine IPs 
as an estimate for Fy, and compute the expectation u* = jug corre- 
sponding to F4. 

1. Generate a bootstrap dataset xj, 75,...,x7, from Fp. 

2. Compute the centered sample mean for the bootstrap dataset: 


<x 
In — M6, 


where ri r r 
ze _, Gil ar iy P88 oP 


n 
nm 


Repeat steps 1 and 2 many times. 


As an application we will use the parametric bootstrap simulation to investi- 
gate whether the exponential distribution is a reasonable model for the soft- 
ware data. 


Are the software data exponential? 


Consider fitting an exponential distribution to the software data, as discussed 
in Section 17.3. At first sight, Figure 17.6 shows a reasonable fit with the ex- 
ponential distribution. One way to quantify the difference between the dataset 
and the exponential model is to compute the maximum distance between the 
empirical distribution function F,, of the dataset and the exponential distri- 
bution function F; estimated from the dataset: 


tks = sup |Fn(a) — Fy (a)|. 
acR 


Here Fy(a) = 0 for a < 0 and 


Fy(a)=1-e™* fora>0, 


where \ = 1 /Zp is estimated from the dataset. The quantity ty, is called the 
Kolmogorov-Smirnov distance between F;, and F°. 


278 18 The bootstrap 


The idea behind the use of this distance is the following. If F' denotes the 
“true” distribution function, then according to Section 17.2 the empirical 
distribution function F,, will resemble F' whether F' equals the distribution 
function Fy of some Exp(A) distribution or not. On the other hand, if the 
“true” distribution function is Fy, then the estimated exponential distribu- 
tion function fF’; will resemble fF’, because A\=1 /Zn is close to the “true” A. 
Therefore, if F = F), then both F;, and F; will be close to the same distribu- 
tion function, so that t,, is small; if F' is different from Ff), then Ff, and F; 
are close to two different distribution functions, so that t,, is large. The value 
tks is always between 0 and 1, and the further away this value is from 0, the 
more it is an indication that the exponential model is inappropriate. For the 
software dataset we find \ = 1/Z, = 0.0015 and ty, = 0.176. Does this speak 
against the believed exponential model? 


One way to investigate this is to find out whether, in the case when the data are 
truly a realization of an exponential random sample from F), the value 0.176 is 
unusually large. To answer this question we consider the sample statistic that 
corresponds to t,,. The estimate A=1 /&n is replaced by the random variable 
A=1 /Xn, and the empirical distribution function of the dataset is replaced 
by the empirical distribution function of the random sample Xj, X2,...,Xn 
(again denoted by F,,): 
number of X; less than or equal to a 

E, (a) = — 


n 


In this way, tys is a realization of the sample statistic 
Tks = sup |F,(a) — Fy(a)|. 
acR 


To find out whether 0.176 is an exceptionally large value for the random vari- 
able Ts, we must determine the probability distribution of T,.,. However, this 
is impossible because the parameter A of the Exp(A) distribution is unknown. 
We will approximate the distribution of T,, by a parametric bootstrap. We use 
the dataset to estimate \ by \ = 1/%, = 0.0015 and replace the random sam- 
ple X1, X2,...,Xn from F, by a bootstrap random sample X7,X3,...,X; 
from Fy. Next we approximate the distribution of 7, by that of its boot- 
strapped version 
Tis = sup |F7 (a) — Fie (a)|, 
acR 


where F is the empirical distribution function of the bootstrap random sam- 


ple: 


P number of X;* less than or equal to a 
73) =r Tree 


and A* =1 /X*, with X* being the average of the bootstrap random sample. 


The bootstrapped sample statistic Tj3, is too complicated to determine its 
probability distribution, and hence we perform a parametric bootstrap simu- 
lation: 


18.4 Solutions to the quick exercises 279 


1. We generate a bootstrap dataset x], 73,...,2]35 from an exponential dis- 
tribution with parameter \ = 0.0015. 


2. We compute the bootstrapped KS distance 


ths = sup | F(a) = Fy. (a)], 
acR 
where F* denotes the empirical distribution function of the bootstrap 
dataset and Fs. denotes the estimated exponential distribution function, 


where \* = 1 /&* is computed from the bootstrap dataset. 


We repeat steps 1 and 2 one thousand times, which results in one thousand 
values of the bootstrapped KS distance. In Figure 18.2 we have displayed a 
histogram and kernel density estimate of the one thousand bootstrapped KS 
distances. It is clear that if the software data would come from an exponential 
distribution, the value 0.176 of the KS distance would be very unlikely! This 
strongly suggests that the exponential distribution is not the right model for 
the software data. The reason for this is that the Poisson process is the wrong 
model for the series of failures. A closer inspection shows that the rate at 
which failures occur over time is not constant, as was assumed in Chapter 17, 
but decreases. 


25 25 

20 20 

15 15 

10 10 

5 5 

0 0 
0 0.176 0 0.176 


Fig. 18.2. One thousand bootstrapped KS distances. 


18.4 Solutions to the quick exercises 


18.1 You could have written something like the following: “Use the dataset 
©1,%2,..-,%p to compute an estimate F for F’. Replace the random sample 


X1,Xe2,...,Xn from F by a random sample X}¥,X5,...,X} from F, and 
approximate the probability distribution of 


280 18 The bootstrap 
Med X4X5012,X%,)— F™ (05) 


by that of Med(X{,X3,...,X}) — F'-v (0.5), where Fi?¥(0.5) is the median 
of F’.” 


18.2 You could have written something like the following: “Given a dataset 
%1,%2,...,2n, determine its empirical distribution function F, as an estimate 


of F, and the median F™Y(0.5) of Fn. 


1. Generate a bootstrap dataset xj,x4,...,2% from Fh. 


2. Compute the sample median for the bootstrap dataset: 
Med, — F"™" (0.5), 


ye - kk * 
where Med, = sample median of x},25,...,X7,. 


Repeat steps 1 and 2 many times.” 


Note that if n is odd, then FY(0.5) equals the sample median of the original 
dataset, but this is not necessarily so for n even. 


18.3 According to Remark 11.2 about the sum of independent normal ran- 
dom variables, the sum of n independent N(, 1) distributed random variables 
has an N(ny,n) distribution. Hence by the change-of-units rule for the normal 
distribution (see page 106), it follows that X,, has an N(,1/n) distribution, 
and that X,, — p has an N(0,1/n) distribution. Similarly, the average X* of 
n independent N(Z,,,1) distributed bootstrap random variables has a nor- 
mal distribution N(Z,,1/n) distribution, and therefore X* — %,, again has an 
N(0,1/n) distribution. 


18.5 Exercises 


18.1 LE] We generate a bootstrap dataset x7,x5,...,2%§ from the empirical 
distribution function of the dataset 


211 £4 6 8, 


i.e., we draw (with replacement) six values from these numbers with equal 
probability 1/6. How many different bootstrap datasets are possible? Are 
they all equally likely to occur? 


18.2 We generate a bootstrap dataset x], 25,273,274 from the empirical distri- 
bution function of the dataset 


1 3 4 6. 


a. Compute the probability that the bootstrap sample mean is equal to 1. 


18.5 Exercises 281 


b. Compute the probability that the maximum of the bootstrap dataset is 
equal to 6. 

c. Compute the probability that exactly two elements in the bootstrap sam- 

ple are less than 2. 


18.3 H We generate a bootstrap dataset x],2%5,...,%j9 from the empirical 
distribution function of the dataset 


0.39 0.41 0.38 0.44 0.40 
0.36 0.34 0.46 0.35 0.37. 


a. Compute the probability that the bootstrap dataset has exactly three 
elements equal to 0.35. 

b. Compute the probability that the bootstrap dataset has at most two ele- 
ments less than or equal to 0.38. 

c. Compute the probability that the bootstrap dataset has exactly two ele- 
ments less than or equal to 0.38 and all other elements greater than 0.42. 


18.4 ©] Consider the dataset from Exercise 18.3, with maximum 0.46. 


a. We generate a bootstrap random sample Xf, X3,..., Xj from the empir- 
ical distribution function of the dataset. Compute P(Mj) < 0.46), where 
Myy = max{X7, X3,..., Xp}. 

b. The same question as in a, but now for a dataset with distinct elements 
£1, %2,...,%, and maximum m,. Compute P(M; < m,), where M; is 
the maximum of a bootstrap random sample Xf, X3,...,X* generated 
from the empirical distribution function of the dataset. 


18.5 EJ Suppose we have a dataset 
0 3 6, 


which is the realization of a random sample from a distribution function F’. If 
we estimate F by the empirical distribution function, then according to the 
bootstrap principle applied to the centered sample mean X3 — pu, we must 
replace this random variable by its bootstrapped version X3 — %3. Determine 
the possible values for the bootstrap random variable X 3 — 3 and the corre- 
sponding probabilities. 


18.6 Suppose that the dataset 21,2%2,...,2, is a realization of a random 

sample from an Exp(A) distribution with distribution function F)\, and that 

In = 59. 

a. Check that the median of the Ezp(A) distribution is m, = (In2)/A (see 
also Exercise 5.11). 


b. Suppose we estimate » by 1/Z,. Describe the parametric bootstrap sim- 
ulation for Med(X1, X2,...,Xn) — my. 


282 18 The bootstrap 


18.7 H To give an example in which the bootstrapped centered sample mean 
in the parametric and empirical bootstrap simulations may be different, con- 
sider the following situation. Suppose that the dataset 71, 22,...,%n is a re- 
alization of a random sample from a U(0,6) distribution with expectation 
pt = 0/2. We estimate @ by 


where my = max{#1,22,...,%n}. Describe the parametric bootstrap simula- 
tion for the centered sample mean X, — [u. 


18.8 Here is an example in which the bootstrapped centered sample mean in 
the parametric and empirical bootstrap simulations are the same. Consider the 
software data with average Z,, = 656.8815 and median m,, = 290, modeled as 
a realization of a random sample Xj, X2,...,X, from a distribution function 
F with expectation . By means of bootstrap simulation we like to get an 
impression of the distribution of X,, — p. 


a. Suppose that we assume nothing about the distribution of the interfailure 
times. Describe the appropriate bootstrap simulation procedure with one 
thousand repetitions. 


b. Suppose we assume that F is the distribution function of an Exp(A) distri- 
bution, where A is estimated by 1/%,, = 0.0015. Describe the appropriate 
bootstrap simulation procedure with one thousand repetitions. 


c. Suppose we assume that F is the distribution function of an Exp(A) dis- 
tribution, and that (as suggested by Exercise 18.6a) the parameter A 
is estimated by (In2)/m, = 0.0024. Describe the appropriate bootstrap 
simulation procedure with one thousand repetitions. 


18.9 EF) Consider the dataset from Exercises 15.1 and 17.6 consisting of mea- 
sured chest circumferences of Scottish soldiers with average Z, = 39.85 and 
sample standard deviation s,, = 2.09. The histogram in Figure 17.11 suggests 
modeling the data as the realization of arandom sample X,, X2,...,X» from 
an N(,07) distribution. We estimate ys by the sample mean and we are inter- 
ested in the probability that the sample mean deviates more than 1 from yp: 
P(X —p|> 1), Describe how one can use the bootstrap principle to approx- 
imate this probability, i.e., describe the distribution of the bootstrap random 
sample Xj}, X3,...,X* and compute P(|X* — y*| > 1). Note that one does 
not need a simulation to approximate this latter probability. 


18.10 Consider the software data, with average Z,, = 656.8815, modeled as 
a realization of a random sample X1, X2,...,X, from a distribution func- 
tion F’. We estimate the expectation ys of F by the sample mean and we are 
interested in the probability that the sample mean deviates more than ten 
from pi: P(|Xp, — y| > 10). 


18.5 Exercises 283 


a. Suppose we assume nothing about the distribution of the interfailure 
times. Describe how one can obtain a bootstrap approximation for the 
probability, i.e., describe the appropriate bootstrap simulation procedure 
with one thousand repetitions and how the results of this simulation can 
be used to approximate the probability. 

b. Suppose we assume that F is the distribution function of an Exp(A) dis- 
tribution. Describe how one can obtain a bootstrap approximation for the 
probability. 


18.11 Consider the dataset of measured chest circumferences of 5732 Scottish 
soldiers (see Exercises 15.1, 17.6, and 18.9). The Kolmogorov-Smirnov distance 
between the empirical distribution function and the distribution function 
F,,,s, of the normal distribution with estimated parameters f = Zp, = 39.85 
and 6 = Ss, = 2.09 is equal to 


tks = sup |F,(a) — F3,,,s,, (a)| = 0.0987, 
acR 


where Z,, and s, denote sample mean and sample standard deviation of the 
dataset. Suppose we want to perform a bootstrap simulation with one thou- 
sand repetitions for the KS distance to investigate to which degree the value 
0.0987 agrees with the assumed normality of the dataset. Describe the appro- 
priate bootstrap simulation that must be carried out. 


18.12 To give an example where the empirical bootstrap fails, consider the 
following situation. Suppose our dataset 271,%2,...,%, is a realization of a 
random sample X1, X2,...,X,, from a U(0, 6) distribution. Consider the nor- 


malized sample statistic 
Mn 


Q” 
where M,, is the maximum of X 1, Xo,..., Xn. Let X}, X3,...,X}7 be a boot- 
strap random sample from the empirical distribution function of our dataset, 
and let M; be the corresponding bootstrap maximum. We are going to com- 
pare the distribution functions of T,, and its bootstrap counterpart 


Tn =1- 


M* 
* n 
Te =1- : 
Mn 
where m,, is the maximum of 71, 7%2,...,0n.- 


a. Check that P(T;, <0) = 0 and show that 


n 


p(T; <0)=1- (1-2) . 


Hint: first argue that P(T;* < 0) = P(M* = m,,), and then use the result 
of Exercise 18.4. 


284 18 The bootstrap 


b. Let G,,(¢t) = P(T,, < t) be the distribution function of T,,, and similarly let 
G* (t) = P(T* < t) be the distribution function of the bootstrap statistic 
T*. Conclude from part a that the maximum distance between G* and 
G,, can be bounded from below as follows: 


1 n 

sup |G*(t) — G,(é)| > 1- (1 = =) | 

tcR n 

c. Use part b to argue that for all n, the maximum distance between G* 
and G, is greater than 0.632: 


sup |G* (t) — G,,(£)| > 1—e7* = 0.632. 
teR 


Hint: you may use that e~* > 1— <2 for all x. 
We conclude that even for very large sample sizes the maximum distance 


between the distribution functions of T;, and its bootstrap counterpart T* 
is at least 0.632. 


18.13 (Exercise 18.12 continued). In contrast to the empirical bootstrap, the 
parametric bootstrap for T,, does work. Suppose we estimate the parameter 6 
of the U(0, 6) distribution by 


n+1 
n 


§= 


Mn, Where Mp, = maximum of 71, %2,...,2n- 


Let now X*,X%,...,X* be a bootstrap random sample from a U(0,6) dis- 
tribution, and let /* be the corresponding bootstrap maximum. Again, we 
are going to compare the distribution function G,, of T,, = 1— M,,/6 with the 
distribution function G*, of its bootstrap counterpart T* = 1 — M*/6. 


a. Check that the distribution function Fy of a U(0,0) distribution is given 
by 


Fo(a) = for0O<a<@. 


b. Check that the distribution function of T,, is 
Gn(t) =P(T, <t)=1-(1-#)” for0<t<1. 


Hint: rewrite P(T, <t) as 1 — P(M,, < 0(1—t)) and use the rule on 
page 109 about the distribution function of the maximum. 


c. Show that T* has the same distribution function: 
G*(t) =P(T* <t)=1-(1-2)" for0<t<1. 


This means that, in contrast to the empirical bootstrap (see Exer- 
cise 18.12), the parametric bootstrap works perfectly in this situation. 


19 


Unbiased estimators 


In Chapter 17 we saw that a dataset can be modeled as a realization of a 
random sample from a probability distribution and that quantities of interest 
correspond to features of the model distribution. One of our tasks is to use the 
dataset to estimate a quantity of interest. We shall mainly deal with the situ- 
ation where it is modeled as one of the parameters of the model distribution 
or as a certain function of the parameters. We will first discuss what we mean 
exactly by an estimator and then introduce the notion of unbiasedness as a 
desirable property for estimators. We end the chapter by providing unbiased 
estimators for the expectation and variance of a model distribution. 


19.1 Estimators 


Consider the arrivals of packages at a network server. One is interested in the 
intensity at which packages arrive on a generic day and in the percentage of 
minutes during which no packages arrive. If the arrivals occur completely at 
random in time, the arrival process can be modeled by a Poisson process. This 
would mean that the number of arrivals during one minute is modeled by a 
random variable having a Poisson distribution with (unknown) parameter wp. 
The intensity of the arrivals is then modeled by the parameter y itself, and 
the percentage of minutes during which no packages arrive is modeled by the 
probability of zero arrivals: e-“. Suppose one observes the arrival process for a 
while and gathers a dataset 71, 2%2,...,%,, where x; represents the number of 
arrivals in the 7th minute. Our task will be to estimate, based on the dataset, 
the parameter y and a function of the parameter: e~". 


This example is typical for the general situation in which our dataset is mod- 
eled as a realization of a random sample X1, X2,...,X, from a probability 
distribution that is completely determined by one or more parameters. The 
parameters that determine the model distribution are called the model param- 
eters. We focus on the situation where the quantity of interest corresponds 


286 19 Unbiased estimators 


to a feature of the model distribution that can be described by the model 
parameters themselves or by some function of the model parameters. This 
distribution feature is referred to as the parameter of interest. In discussing 
this general setup we shall denote the parameter of interest by the Greek 
letter 6. So, for instance, in our network server example, py is the model pa- 
rameter. When we are interested in the arrival intensity, the role of @ is played 
by the parameter p itself, and when we are interested in the percentage of 
idle minutes the role of @ is played by e~". 


Whatever method we use to estimate the parameter of interest 0, the result 
depends only on our dataset. 


ESTIMATE. An estimate is a value t that only depends on the dataset 
%1,%2,...,@n, 1.e., t is some function of the dataset only: 


t= li ity Brg wo = 9 Bm))c 


This description of estimate is a bit formal. The idea is, of course, that the 
value t, computed from our dataset 21, %2,...,%n, gives some indication of 
the “true” value of the parameter 6. We have already met several estimates in 
Chapter 17; see, for instance, Table 17.2. This table illustrates that the value 
of an estimate can be anything: a single number, a vector of numbers, even a 
complete curve. 


Let us return to our network server example in which our dataset 71, 72,...,%n 
is modeled as a realization of a random sample from a Pois(:) distribution. 
The intensity at which packages arrive is then represented by the parameter ju. 
Since the parameter p is the expectation of the model distribution, the law 
of large numbers suggests the sample mean Z, as a natural estimate for p. 
On the other hand, the parameter p also represents the variance of the model 
distribution, so that by a similar reasoning another natural estimate is the 
sample variance s?. 

The percentage of idle minutes is modeled by the probability of zero arrivals. 
Similar to the reasoning in Section 13.4, a natural estimate is the relative 
frequency of zeros in the dataset: 


number of x; equal to zero 
. . 


On the other hand, the probability of zero arrivals can be expressed as a 
function of the model parameter: e~”. Hence, if we estimate pw by Zn, we 
could also estimate e~" by e77". 


QUICK EXERCISE 19.1 Suppose we estimate the probability of zero arrivals 
e-" by the relative frequency of x; equal to zero. Deduce an estimate for pu 
from this. 


19.2 Investigating the behavior of an estimator 287 


The preceding examples illustrate that one can often think of several estimates 
for the parameter of interest. This raises questions like 


e When is one estimate better than another? 
e Does there exist a best possible estimate? 


For instance, can we say which of the values %, or s? computed from the 
dataset is closer to the “true” parameter ? The answer is no. The measure- 
ments and the corresponding estimates are subject to randomness, so that 
we cannot say anything with certainty about which of the two is closer to pu. 
One of the things we can say for each of them is how likely it is that they are 
within a given distance from p. To this end, we consider the random variables 
that correspond to the estimates. Because our dataset 11, 22,...,2%n is mod- 
eled as a realization of a random sample Xj, X2,...,Xn, the estimate t is a 
realization of a random variable T. 


ESTIMATOR. Let t = h(a1,22,...,%n) be an estimate based on the 
dataset 71,2%2,...,%,. Then t is a realization of the random variable 


TP = MM WGg XOy.0005 Ge) 
The random variable T is called an estimator. 


The word estimator refers to the method or device for estimation. This is 
distinguished from estimate, which refers to the actual value computed from 
a dataset. Note that estimators are special cases of sample statistics. In the 
remainder of this chapter we will discuss the notion of unbiasedness that 
describes to some extent the behavior of estimators. 


19.2 Investigating the behavior of an estimator 


Let us continue with our network server example. Suppose we have observed 
the network for 30 minutes and we have recorded the number of arrivals in 
each minute. The dataset is modeled as a realization of a random sample 
X1,Xo,...,Xn of size n = 30 from a Pois(y) distribution. Let us concentrate 
on estimating the probability po of zero arrivals, which is an unknown number 
between 0 and 1. As motivated in the previous section, we have the following 
possible estimators: 


number of X; equal to zero e 


S= and T=e *, 
n 
Our first estimator S can only attain the values 0, a =, ...,1, so that in 


general it cannot give the exact value of po. Similarly for our second estima- 
tor T, which can only attain the values 1,e7!/8°,e~?/89,... . So clearly, we 


288 19 Unbiased estimators 


cannot expect our estimators always to give the exact value of po on basis of 
30 observations. Well, then what can we expect from a reasonable estimator? 


To get an idea of the behavior of both estimators, we pretend we know pu 
and we simulate the estimation process in the case of n = 30 observations. 
Let us choose w = In10, so that pop = e~* = 0.1. We draw 30 values from 
a Poisson distribution with parameter u = In10 and compute the value of 
estimators S and JT. We repeat this 500 times, so that we have 500 values 
for each estimator. In Figure 19.1 a frequency histogram! of these values 
for estimator S' is displayed on the left and for estimator T on the right. 
Clearly, the values of both estimators vary around the value 0.1, which they 
are supposed to estimate. 


250 250 

200 200 

150 150 

100 100 

50 50 

0 0 
0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 


Fig. 19.1. Frequency histograms of 500 values for estimators S' (left) and T (right) 
of po = 0.1. 


19.3 The sampling distribution and unbiasedness 


We have just seen that the values generated for estimator S' fluctuate around 
po = 0.1. Although the value of this estimator is not always equal to 0.1, it 
is desirable that on average, S is on target, i.e., E[.S] = 0.1. Moreover, it is 
desirable that this property holds no matter what the actual value of po is, 
ie., 


E[S] = Po 


irrespective of the value 0 < po < 1. In order to find out whether this is 
true, we need the probability distribution of the estimator S. Of course this 


' In a frequency histogram the height of each vertical bar equals the frequency of 
values in the corresponding bin. 


19.3 The sampling distribution and unbiasedness 289 


is simply the distribution of a random variable, but because estimators are 


constructed from a random sample Xj, X2,...,X», we speak of the sampling 
distribution. 
THE SAMPLING DISTRIBUTION. Let T = h(X1, X2,...,Xn) be an 
estimator based on a random sample Xj, X2,...,X,. The probabil- 


ity distribution of T is called the sampling distribution of T. 


The sampling distribution of S can be found as follows. Write 


ga%, 
n 

where Y is the number of X; equal to zero. If for each 7 we label X; = 0 as 
a success, then Y is equal to the number of successes in n independent trials 
with po as the probability of success. Similar to Section 4.3, it follows that Y 
has a Bin(n, po) distribution. Hence the sampling distribution of S is that of 
a Bin(n, po) distributed random variable divided by n. This means that S' is 
a discrete random variable that attains the values k/n, where k = 0,1,...,n, 
with probabilities given by 


vs (£) P(s =) P(Y =h) (j, 2b — poy 


The probability mass function of S for the case n = 30 and po = 0.1 is 
displayed in Figure 19.2. Since S = Y/n and Y has a Bin(n, po) distribution, 
it follows that 


E|Y 
ot 
n n 
So, indeed, the estimator S' for po has the property E[S] = po. This property 


reflects the fact that estimator S has no systematic tendency to produce 


0.20 


ps(a) 


0.05 ° 


0.00 coerce eeoeeoeoeo eee eee eee ee 


0.0 0.2 0.4 0.6 0.8 1.0 


a 


Fig. 19.2. Probability mass function of S. 


290 19 Unbiased estimators 


estimates that are larger than po, and no systematic tendency to produce 
estimates that are smaller than pg. This is a desirable property for estimators, 
and estimators that have this property are called unbiased. 


DEFINITION. An estimator T is called an unbiased estimator for the 
parameter 0, if 


BIT) =0 


irrespective of the value of 0. The difference E[T] — 0 is called the 
bias of T; if this difference is nonzero, then T is called biased. 


Let us return to our second estimator for the probability of zero arrivals in 


the network server example: T = e~*”. The sampling distribution can be 
obtained as follows. Write 


T = ea, 


where Z = X,+ X9+---+X,. From Exercise 12.9 we know that the random 
variable Z, being the sum of n independent Pois(s:) random variables, has 
a Pois(ny) distribution. This means that T is a discrete random variable 


attaining values e~*/", where k = 0,1,... and the probability mass function 
of T is given by 


—n k 
pr (e-*/") = P(T = etn) =P(Z=k)= as 
The probability mass function of T for the case n = 30 and po = 0.1 is 
displayed in Figure 19.3. From the histogram in Figure 19.1 as well as from 
the probability mass function in Figure 19.3, you may get the impression 
that T is also an unbiased estimator. However, this not the case, which follows 
immediately from an application of Jensen’s inequality: 


0.05 
0.04 
0.03 
pr(a) 
0.02 
0.01 


0.00 J "Nncsisssnasassseedsasiesise 0's's\a.e'e'e:e\e. eie10.6 da: 60nei'e 


0.0 0.2 0.4 0.6 0.8 1.0 


a 


Fig. 19.3. Probability mass function of T. 


19.3 The sampling distribution and unbiasedness 291 
E[T|=E je“*] > eX), 
where we have a strict inequality because the function g(a) = e~® is strictly 
convex (g(x) = e~* > 0). Recall that the parameter py equals the expectation 


of the Pois(j) model distribution, so that according to Section 13.1 we have 
E[X,] =. We find that 


E[T] >e * = po, 


which means that the estimator T for po has positive bias. In fact we can 
compute E[T] exactly (see Exercise 19.9): 


E[T]=E[e*] seme), 
Note that n(1 —e~/") > 1, so that 
E[T] =e-"#0-e-"") _, e-# = py 


as n goes to infinity. Hence, although T' has positive bias, the bias decreases 
to zero as the sample size becomes larger. In Figure 19.4 the expectation of 
T is displayed as a function of the sample size n for the case 4 = In(10). For 
n = 30 the difference between E[T] and pp = 0.1 equals 0.0038. 


0.25 

0.20 

0.15 ; 

E/T] ne 

0.05 

0.00 
ns es a a 
0 5 10 15 20 25 30 

nm 


Fig. 19.4. E[T] as a function of n. 


QUICK EXERCISE 19.2 If we estimate pp = e-" by the relative frequency of 
zeros S = Y/n, then we could estimate pp by U = —In(S). Argue that U is a 
biased estimator for js. Is the bias positive or negative? 


We conclude this section by returning to the estimation of the parameter ju. 
Apart from the (biased) estimator in Quick exercise 19.2 we also considered 


292 19 Unbiased estimators 


the sample mean X, and sample variance S? as possible estimators for pu. 
These are both unbiased estimators for the parameter py. This is a direct 
consequence of a more general property of X, and $2, which is discussed in 
the next section. 


19.4 Unbiased estimators for expectation and variance 


Sometimes the quantity of interest can be described by the expectation or 
variance of the model distribution, and is it irrelevant whether this distribution 
is of a parametric type. In this section we propose unbiased estimators for 
these distribution features. 


UNBIASED ESTIMATORS FOR EXPECTATION AND VARIANCE. Sup- 
pose Xj, X2,...,Xy is a random sample from a distribution with 
finite expectation jz and finite variance 07. Then 


ee ee 
5 cee ak aes 


n 


is an unbiased estimator for and 

1 n 

=i es a 
i=1 


is an unbiased estimator for 07. 


The first statement says that E[X. ale = p, which was shown in Section 13.1. 
The second statement says E [Se o”. To see this, use linearity of expecta- 
tions to write 


Xn)’] . 


Since E Ee om = pL, we have E [Xi xJ- E| Xn] = 0. Now note that 
for any random variable Y mith BIY J= ce we Ae 


Var(Y) = E[Y?] — (B[Y])? =E[Y?]. 
Applying this to Y = X; — Xp, it follows that 
E [(X; — Xn)*] = Var(X; — Xn). 


Note that we can write 


- n—-1 1 
X;—-Xn= x,--5 X;. 
n n wr y 
j#i 


19.4 Unbiased estimators for expectation and variance 293 


Then from the rules concerning variances of sums of independent random 
variables we find that 


is if 
(n —1)? 1 
oe Var(X;) +  S~ Var(X;) 
j#i 
_ f(m-1)? _n-1] . n-1., 
=| n? = n2 n 
We conclude that 
1 < e 
E = = E X; — X,)? 
8) = hy Blo - 207 
1 < 1 m-l» 4 
= —— DVar(Xi — Xn) = — n- ; Oo =a 


This explains why we divide by n — 1 in the formula for $?; only in this case 
S? is an unbiased estimator for the “true” variance o?. If we would divide by 
n instead of n — 1, we would obtain an estimator with negative bias; it would 


systematically produce too-small estimates for 0. 


QUICK EXERCISE 19.3 Consider the following estimator for 0: 


i=1 


Compute the bias E[V,?] — 0? for this estimator, where you can keep compu- 
tations simple by realizing that V,? = (n — 1)S2/n. 


Unbiasedness does not always carry over 


We have seen that S? is an unbiased estimator for the “true” variance 07. A 
natural question is whether S,, is again an unbiased estimator for 0. This is not 
the case. Since the function g(x) = x? is strictly convex, Jensen’s inequality 
yields that 

a =E|S?| > (B[S,))’; 


which implies that E[S,,] < o. Another example is the network arrivals, in 
which X,, is an unbiased estimator for 1, whereas e~*” is positively biased 
with respect to e~”. These examples illustrate a general fact: unbiasedness 
does not always carry over, i.e., if J’ is an unbiased estimator for a parameter 6, 
then g(T) does not have to be an unbiased estimator for g(@). 


294 19 Unbiased estimators 


However, there is one special case in which unbiasedness does carry over, 
namely if g(T) = aT +b. Indeed, if T is unbiased for 6: E/T] = 0, then by the 
change-of-units rule for expectations, 


E[aT + b] = aE[T|] +b =a +), 


which means that aT + 6 is unbiased for a + 0. 


19.5 Solutions to the quick exercises 


19.1 Write y for the number of x; equal to zero. Denote the probability of 
zero by po, so that po = e-“. This means that 4p = —In(po). Hence if we 
estimate po by the relative frequency y/n, we can estimate jz by — In(y/n). 


19.2 The function g(x) = —In(z) is strictly convex, since g" (x) = 1/2? > 0. 
Hence by Jensen’s inequality 


E(U] = E[-1n(S)] > —In(E[S}). 


Since we have seen that E[S] = pp = e", it follows that E[U] > —In(E[S]) = 
—In(e~“) = yw. This means that U has positive bias. 


19.3 Using that E [2] =o”, we find that 


E[V2] =E Ee +s3| oo nie) 


nm n nm 


We conclude that the bias of V,? equals E[V,?2] — 0? = —0?/n <0. 


19.6 Exercises 


19.1 H Suppose our dataset is a realization of arandom sample X 1, X2,...,Xn 
from a uniform distribution on the interval [—0, 6], where @ is unknown. 


a. Show that 


n 


3 
Pa Mg ee) 


is an unbiased estimator for 6”. 


b. Is /T also an unbiased estimator for 0? If not, argue whether it has 
positive or negative bias. 


19.2 Suppose the random variables X,, X2,...,Xn have the same expecta- 
tion ju. 


19.6 Exercises 295 
a. Is S= 3X + 3X2 + +X3 an unbiased estimator for py? 
b. Under what conditions on constants a1, 42,...,@n is 
T =a, X, + agXo+---+anXn 


an unbiased estimator for ju? 


19.3 LH) Suppose the random variables X,, X2,...,X, have the same expec- 
tation 4. For which constants a and 0 is 


T= a(X, + Xo4+---+Xn) +6 


an unbiased estimator for ju? 


19.4 Recall Exercise 17.5 about the number of cycles to pregnancy. Suppose 
the dataset corresponding to the table in Exercise 17.5a is modeled as a 
realization of a random sample X1, X2,...,Xn from a Geo(p) distribution, 

where 0 < p < 1 is unknown. Motivated by the law of large numbers, a 

natural estimator for p is 

Tali xe 

a. Check that T is a biased estimator for p and find out whether it has 
positive or negative bias. 

b. In Exercise 17.5 we discussed the estimation of the probability that a 
woman becomes pregnant within three or fewer cycles. One possible esti- 
mator for this probability is the relative frequency of women that became 
pregnant within three cycles 

number of X; <3 

n 


S= 


Show that S is an unbiased estimator for this probability. 


19.5 EL] Suppose a dataset is modeled as a realization of a random sample 
X1,X9,...,Xn from an Exp(A) distribution, where \ > 0 is unknown. Let 
b denote the corresponding expectation and let M,, denote the minimum of 
X1,X9,...,Xn. Recall from Exercise 8.18 that M, has an Exp(n A) distribu- 
tion. Find out for which constant c the estimator 


T=cM, 


is an unbiased estimator for pu. 


19.6 EJ Consider the following dataset of lifetimes of ball bearings in hours. 


6278 3113 5236 11584 12628 7725 8604 14266 6125 9350 
3212 9003 3523 12888 9460 13431 17809 2812 11825 2398 


Source: J.E. Angus. Goodness-of-fit tests for exponentiality based on a loss- 
of-memory type functional equation. Journal of Statistical Planning and In- 
ference, 6:241-251, 1982; example 5 on page 249. 


296 19 Unbiased estimators 


One is interested in estimating the minimum lifetime of this type of ball bear- 
ing. The dataset is modeled as a realization of a random sample Xj,..., Xn. 
Each random variable X; is represented as 


X,=0+Yi, 


where Y; has an Exp(A) distribution and 6 > 0 is an unknown parameter that 
is supposed to model the minimum lifetime. The objective is to construct an 
unbiased estimator for 6. It is known that 


1 = 1 
E[M,]=6+—— and E[X,] =64+-, 
where M,, = minimum of X1, X2,...,Xn and X, = (X,+ Xo4+-+-+ Xn)/n. 


a. Check that 


n—-1 


pa (x, n) 


is an unbiased estimator for 1/X. 
b. Construct an unbiased estimator for 6. 


c. Use the dataset to compute an estimate for the minimum lifetime 6. You 
may use that the average lifetime of the data is 8563.5. 


19.7 Leaves are divided into four different types: starchy-green, sugary-white, 
starchy-white, and sugary-green. According to genetic theory, the types occur 
with probabilities $(@ + 2), 0, (1 — 4), and 4(1 — 8), respectively, where 
0 < @ < 1. Suppose one has n leaves. Then the number of starchy-green 
leaves is modeled by a random variable Ni with a Bin(n,p;) distribution, 
where p; = 4(4+ 2), and the number of sugary-white leaves is modeled by 
a random variable No with a Bin(n,p2) distribution, where py = +0. The 
following table lists the counts for the progeny of self-fertilized heterozygotes 
among 3839 leaves. 


Type Count 


Starchy-green 1997 
Sugary-white 32 
Starchy-white 906 
Sugary-green 904 


Source: R.A. Fisher. Statistical methods for research workers. Hafner, New 
York, 1958; Table 62 on page 299. 


Consider the following two estimators for 6: 


4 4 
Ti = —N, —2 and T> = —Np. 
n n 


19.6 Exercises 297 


a. Check that both T, and T> are unbiased estimators for 0. 


b. Compute the value of both estimators for @. 


19.8 H Recall the black cherry trees example from Exercise 17.9, modeled by 
a linear regression model without intercept 


Y, = Gx, +U; fori =1,2,...,n, 


where U;,U2,...,Un are independent random variables with E[U;] = 0 and 
Var(U;) = 07. We discussed three estimators for the parameter (3: 


1/Y% Yn 
B= 2 (S443), 


M\ ey Ln 
Y eee Y 
Bz = ———————__, 
XY SS ee od rn 
LY, ii InYn 
B3 = 5 5 
ry SS od Lin 


Show that all three estimators are unbiased for (. 


19.9 Consider the network example where the dataset is modeled as a real- 
ization of a random sample X1, X2,...,X, from a Pois(j:) distribution. We 
estimate the probability of zero arrivals e~“ by means of T = e~*”. Check 
that 

E[(T] = e nH(i—er i) 
Hint: write T = e~2/", where Z = X, + Xo+---+ Xp, has a Pois (np) 
distribution. 


20 


Efficiency and mean squared error 


In the previous chapter we introduced the notion of unbiasedness as a de- 
sirable property of an estimator. If several unbiased estimators for the same 
parameter of interest exist, we need a criterion for comparison of these estima- 
tors. A natural criterion is some measure of spread of the estimators around 
the parameter of interest. For unbiased estimators we will use variance. For 
arbitrary estimators we introduce the notion of mean squared error (MSE), 
which combines variance and bias. 


20.1 Estimating the number of German tanks 


In this section we come back to the problem of estimating German war produc- 
tion as discussed in Section 1.5. We consider serial numbers on tanks, recoded 
to numbers running from 1 to some unknown largest number N. Given is a 
subset of n numbers of this set. The objective is to estimate the total number 
of tanks N on the basis of the observed serial numbers. 


Denote the observed distinct serial numbers by 21, 2%2,...,%,. This dataset 
can be modeled as a realization of random variables X1, X2,...,Xn repre- 
senting n draws without replacement from the numbers 1,2,...,.N with equal 
probability. Note that in this example our dataset is not a realization of a 
random sample, because the random variables X,, X2,...,Xp are dependent. 
We propose two unbiased estimators. The first one is based on the sample 
mean 
so X,+ Xo+-:-+Xy, 


X_ = 
nm 


and the second one is based on the sample maximum 


My = max{X 1, Xo, eae sda 


300 20 Efficiency and mean squared error 


An estimator based on the sample mean 


To construct an unbiased estimator for N based on the sample mean, we start 
by computing the expectation of X,,. The linearity-of-expectations rule also 
applies to dependent random variables, so that 


B[%,] = E[X1] a 


In Section 9.3 we saw that the marginal distribution of each X; is the same: 


1 
P(Xi=h)=— fork=1,2,...,N. 


Therefore the expectation of each X; is given by 


1 i 1 142+4---+N 
B[X,) =1->4+2-s4+---4N-== 
Xi] ‘aa aan N 
_ 4N(N +1) Nel 
N 2 


It follows that 


This directly implies that 7 
T; =2X,—-1 


is an unbiased estimator for N, since the change-of-units rule yields that 


E[f] = EX, — 1] =26(X,] -1=2-7F*-1=n. 


QUICK EXERCISE 20.1 Suppose we have observed tanks with (recoded) serial 


numbers 
61 19 56 24 16. 


Compute the value of the estimator TJ, for the total number of tanks. 


An estimator based on the sample maximum 


To construct an unbiased estimator for N based on the maximum, we first 
compute the expectation of /,,. We start by computing the probability that 
M, = k, where k takes the values n,...,N. Similar to the combinatorics 
used in Section 4.3 to derive the binomial distribution, the number of ways 
to draw n numbers without replacement from 1,2,...,N is Ch. Hence each 
combination has probability 1/ eas In order to have M, = k, we must have 
one number equal to k and choose the other n— 1 numbers out of the numbers 
1,2,...,k—1. There are (=) ways to do this. Hence for the possible values 
k=n,n+1,...,N, 


20.1 Estimating the number of German tanks 301 


Pll = 8) = R= ET 
(k—1)! (N—n)! 
~"(k=n)l NE 


Thus the expectation of M,, is given by 


How to continue the computation of E[M,,]? We use a trick: we start by 
rearranging 


: ~ —1)!(N—-n)! 
i en > oo =i 
finding that 
“j=! NI 
» G—n)! n(N—n)! (20.1) 


This holds for any N and any n < N. In particular we could replace N by 
N+1landnby n+1: 
- G-D! — (N+)! 
an eee) -_ " 
Reser (G-n—-1)! (n+1)\(N —n)! 


Changing the summation variable to k = 7 — 1, we obtain 


(N +1)! 
2 (F—n)! (WF 1(V—n)! (20.2) 


This is exactly what we need to finish the computation of E[M,,]. Substituting 
(20.2) in what we obtained earlier, we find 


302 20 Efficiency and mean squared error 


QUICK EXERCISE 20.2 Choosing n = N in this formula yields E{My] = N. 
Can you argue that this is the right answer without doing any computations? 


With the formula for E[M,,] we can derive immediately that 


1 
i LH 
n 


is an unbiased estimator for N, since by the change-of-units rule, 


n+1 
n 


1 1 n(N+1 
My -1] =" Hi) wt EE ae 


E[T)| = E 
[72] | n n n+1 

QUICK EXERCISE 20.3 Compute the value of estimator T> for the total number 
of tanks on basis of the observed numbers from Quick exercise 20.1. 


20.2 Variance of an estimator 


In the previous section we saw that we can construct two completely different 
estimators for the total number of tanks N that are both unbiased. The obvious 
question is: which of the two is better? To answer this question, we investigate 
how both estimators vary around the parameter of interest N. Although we 
could in principle compute the distributions of TJ, and 72, we carry out a 
small simulation study instead. Take N = 1000 and n = 10 fixed. We draw 
10 numbers, without replacement, from 1,2,...,1000 and compute the value 
of the estimators JT; and T>. We repeat this two thousand times, so that we 
have 2000 values for both estimators. In Figure 20.1 we have displayed the 
histogram of the 2000 values for T; on the left and the histogram of the 2000 
values for T2 on the right. From the histograms, which reflect the probability 


0.008 0.008 

0.006 0.006 

0.004 0.004 

0.002 0.002 

0 0 
300 700 N = 1000 1300 1600 300 700 N = 1000 1300 1600 


Fig. 20.1. Histograms of two thousand values for T; (left) and T> (right). 


20.2 Variance of an estimator 303 


mass functions of both estimators, we see that the distributions of 7, and 
T> are of completely different types. As can be expected from the fact that 
both estimators are unbiased, the values vary around the parameter of interest 
N = 1000. The most important difference between the histograms is that the 
variation in the values of T> is less than the variation in the values of T;. 
This suggests that estimator 7) estimates the total number of tanks more 
efficiently than estimator 7), in the sense that it produces estimates that 
are more concentrated around the parameter of interest N than estimates 
produced by 7;. Recall that the variance measures the spread of a random 
variable. Hence the previous discussion motivates the use of the variance of 
an estimator to evaluate its performance. 


EFFICIENCY. Let 7, and T> be two unbiased estimators for the same 
parameter 9. Then estimator 7> is called more efficient than estima- 
tor T; if Var(Z>) < Var(71), irrespective of the value of 0. 


Let us compare JT) and T> using this criterion. For T; we have 
Var(T)) = Var (2X, _ 1) — AVar(X,,) ; 


Although the X; are not independent, it is true that all pairs (X;,X,;) with 
i # j have the same distribution (this follows in the same way in which 
we showed on page 122 that all X; have the same distribution). With the 
variance-of-the-sum rule for n random variables (see Exercise 10.17), we find 
that 

Var(Xy +--+ + X,) = nVar(X1) + n(n — 1)Cov(Xq, X2). 


In Exercises 9.18 and 10.18, we computed that 


Var(X1) = aN =HiIN#D, Cov, 56) = -S(N+ 1). 


We find therefore that 


= 4 
Var(T1) i AVar(X;,) — 72 Var(X1 feee yt Xn) 


=4 SW 1)(N +1)—n(n 1) G41) 
= (N+ 1)N 1—(n—1)| 

NSD =) 

~3n 


Obtaining the variance of T> is a little more work. One can compute the 
variance of M,, in a way that is very similar to the way we obtained E[M,]. 
The result is (see Remark 20.1 for details) 


n(N + 1)(N — n) 


Var(M,,) = Gs 


304 20 Efficiency and mean squared error 


Remark 20.1 (How to compute this variance). The trick is to com- 
pute not E[M;] but E[M,(M, + 1)]. First we derive an identity from Equa- 
tion (20.1) as before, this time replacing N by N + 2 and n by n + 2: 


N+2 ; 
G-1! | (N + 2)! 
2, (G-n—2)! (n+2)(N—n)! 


Changing the summation variable to k = j — 2 yields 


“(k+)! (N +2)! 
do (eal ~ WED y 


With this formula one can obtain: 


Since we know E|[M,], we can determine E [Mz] from this, and subsequently 
the variance of M,,. 


With the expression for the variance of M,,, we derive 


Var(T>) = Vvar("* ae _ 1) = (nO Var( Ms) — wet 


We see that Var(T2) < Var(7;) for all N and n > 2. Hence T> is always more 
efficient than 7;, except when n = 1. In this case the variances are equal, 
simply because the estimators are the same—they both equal X,. 

The quotient Var(T;) /Var(T2), is called the relative efficiency of Tz with 
respect to J. In our case the relative efficiency of T with respect to T} 
equals 


Var(T;) — (N+1)(N —n) n(n + 2) n+2 


Var(T2) 3n “(N+1I(N—-n) 3, 


Surprisingly, this quotient does not depend on N, and we see clearly the 
advantage of T> over T) as the sample size n gets larger. 


QUICK EXERCISE 20.4 Let n = 5, and let the sample be 
7 3 10 45 15. 
Compute the value of the estimator T; for N. Do you notice anything strange? 


The self-contradictory behavior of T, in Quick exercise 20.4 is not rare: this 
phenomenon will occur for up to 50% of the samples if n and WN are large. 
This gives another reason to prefer T> over T}. 


20.3 Mean squared error 305 


Remark 20.2 (The Cramér-Rao inequality). Suppose we have a ran- 
dom sample from a continuous distribution with probability density function 
fo, where @ is the parameter of interest. Under certain smoothness condi- 
tions on the density fo, the variance of an unbiased estimator T for 0 always 
has to be larger than or equal to a certain positive number, the so-called 
Cramér-Rao lower bound: 


Var(T) > a en for all 0. 


~ nE (Fin fox)" ] 


Here n is the size of the sample and X a random variable whose density 
function is fg. In some cases we can find unbiased estimators attaining this 
bound. These are called minimum variance unbiased estimators. An exam- 
ple is the sample mean for the expectation of an exponential distribution. 
(We will consider this case in Exercise 20.3.) 


20.3 Mean squared error 


In the last section we compared two unbiased estimators by considering their 
spread around the value to be estimated, where the spread was measured by 
the variance. Although unbiasedness is a desirable property, the performance 
of an estimator should mainly be judged by the way it spreads around the 
parameter 0 to be estimated. This leads to the following definition. 


DEFINITION. Let J be an estimator for a parameter 0. The mean 
squared error of T is the number MSE(T) = E|(T — 6)?]. 


According to this criterion, an estimator T; performs better than an estima- 
tor T> if MSE(T,) < MSE(T2). Note that 
MSE(T) = E[(T — 6)?] 

=E((T -B[T|+5B(2)-4)"] 

= E[(T — E[Z))’] + 2E(7 — E[T]] (E[T] — 6) + (E[T] — 8)? 

= Var(T) + (E[T] — 6)?. 
So the MSE of T turns out to be the variance of T plus the square of the bias 
of T. In particular, when T is unbiased, the MSE of T is just the variance 
of T. This means that we already used mean squared errors to compare the 
estimators JT) and T> in the previous section. We extend the notion of efficiency 


by saying that estimator T> is more efficient than estimator T, (for the same 
parameter of interest), if the MSE of T> is smaller than the MSE of 7). 


Unbiasedness and efficiency 


A biased estimator with a small variance may be more useful than an unbiased 
estimator with a large variance. We illustrate this with the network server 


306 20 Efficiency and mean squared error 


10 10 


0 ee? 0.2 0.3 0.4 0 e* 0.2 0.3 0.4 


Fig. 20.2. Histograms of a thousand values for S (left) and T (right). 


example from Section 19.2. Recall that our goal was to estimate the probability 
po =e “ of zero arrivals (of packages) in a minute. We did have two promising 
candidates as estimators: 


number of X; equal to zero : 


s= and T=e*", 


n 
In Figure 20.2 we depict histograms of one thousand simulations of the values 
of S and T computed for random samples of size n = 25 from a Pois() 
distribution, where 4 = 2. Considering the way the values of the (biased!) 
estimator T are more concentrated around the true value e~” = e~? = 0.1353, 
we would be inclined to prefer T over S. This choice is strongly supported 
by the fact that T is more efficient than S: MSE(T) is always smaller than 
MSE(S), as illustrated in Figure 20.3. 


0.010 
0.008 : * MSE(3) 
0.006 7 
0.004 
ca 
0.002 MSE(T) 


0.000 


Fig. 20.3. MSEs of S and T as a function of pu. 


20.5 Exercises 307 


20.4 Solutions to the quick exercises 


20.1 We have Z = (61+ 19 + 56 + 24+ 16)/5 = 176/5 = 35.2. Therefore 
t) =2-35.2-1=69.4. 


20.2 When n = N, we have drawn all the numbers. But then the largest 
number My is N, and so E[My] = N. 


20.3 We have ty = (6/5) -61 — 1 = 72.2. 


20.4 Since 45 is in the sample, N has to be at least 45. Adding the numbers 
yields 7+ 34+ 10+15+45 = 80. So t) = 2%, —1 = 2-16—1= 31. What is 
strange about this is that the estimate for N is far smaller than the number 
45 in the sample! 


20.5 Exercises 


20.1 Given is a random sample X,, X2,..., Xp, from a distribution with finite 
variance a”. We estimate the expectation of the distribution with the sample 
mean X,. Argue that the larger our sample, the more efficient our estimator. 
What is the relative efficiency Var (Xn) / Var (Xan) of Xo, with respect to Xn? 


20.2 H Given are two estimators S and TJ’ for a parameter 0. Furthermore it 
is known that Var(S') = 40 and Var(T) = 4. 


a. Suppose that we know that E[S] = 6 and E[T] = 0+ 3. Which estimator 
would you prefer, and why? 

b. Suppose that we know that E[S] = 0 and E[T] = 0+ a for some positive 
number a. For each a, which estimator would you prefer, and why? 


20.3 H Suppose we have a random sample X1,...,Xn from an Exp(A) distri- 
bution. Suppose we want to estimate the mean 1/A. According to Section 19.4 
the estimator 


i 
Ty = Xn = —(X1 4+ Xn4+--++Xn) 


7 
is an unbiased estimator of 1/A. Let M,, be the minimum of X1, X2,...,Xn. 
Recall from Exercise 8.18 that M,, has an Exp(nA) distribution. In Exer- 
cise 19.5 you have determined that 


T> = nM, 


is another unbiased estimator for 1/A. Which of the estimators T; and T 
would you choose for estimating the mean 1/A? Substantiate your answer. 


308 20 Efficiency and mean squared error 


20.4 FE] Consider the situation of this chapter, where we have to estimate the 
parameter N from a sample 21,...,2, drawn without replacement from the 
numbers {1,...,N}. To keep it simple, we consider n = 2. Let M = Mz be 
the maximum of X; and X2. We have found that Tz = 3M/2-—1 is a good 
unbiased estimator for N. We want to construct a new unbiased estimator 
T3 based on the minimum L of X, and X9. In the following you may use 
that the random variable L has the same distribution as the random variable 
N+1-M (this follows from symmetry considerations). 


a. Show that 73 = 3L — 1 is an unbiased estimator for N. 


b. Compute Var(73) using that Var(M) = (N + 1)(.N — 2)/18. (The latter 
has been computed in Remark 20.1.) 


c. What is the relative efficiency of Ty with respect to T3? 


20.5 Someone is proposing two unbiased estimators U and V, with the same 
variance Var(U) = Var(V). It therefore appears that we would not prefer one 
estimator over the other. However, we could go for a third estimator, namely 
W = (U+V)/2. Note that W is unbiased. To judge the quality of W we 
want to compute its variance. Lacking information on the joint probability 
distribution of U and V, this is impossible. However, we should prefer W in 
any case! To see this, show by means of the variance-of-the-sum rule that the 
relative efficiency of U with respect to W is equal to 
Var((U+V)/2) 1 1 
Var(U) 2 + 2P\4%V). 

Here p(U, V) is the correlation coefficient. Why does this result imply that we 
should use W instead of U (or V)? 


20.6 A geodesic engineer measures the three unknown angles a1, a2, and a3 
of a triangle. He models the uncertainty in the measurements by considering 
them as realizations of three independent random variables 7;,7>, and T3 
with expectations 


E[T}] = Q1, E [75] = 2, E[T3] = a3, 


and all three with the same variance o?. In order to make use of the fact that 
the three angles must add to 7, he also considers new estimators U;, U2, and 
U3 defined by 


U; =T) 4( Ty T> T3), 
Ug =7T) 4(x Ty T> T3), 
U3 =T3 t a(x Ty T> T3). 


(Note that the “deviation” 7 — T, — Tz — T3 is evenly divided over the three 
measurements and that U; + Uz + U3 = 7.) 


20.5 Exercises 309 


a. Compute E[U;] and Var(U;) . 

b. What does he gain in efficiency when he uses U; instead of T, to estimate 
the angle a1? 

c. What kind of estimator would you choose for a, if it is known that the 
triangle is isosceles (i.e., @1 = a2)? 


20.7 © (Exercise 19.7 continued.) Leaves are divided into four different types: 
starchy-green, sugary-white, starchy-white, and sugary-green. According to 
genetic theory, the types occur with probabilities $(6 + 2), $0, $(1 — 0), and 
(1 — 6), respectively, where 0 < @ < 1. Suppose one has n leaves. Then the 
number of starchy-green leaves is modeled by a random variable Ni with a 
Bin(n,p1) distribution, where p; = +(6+ 2), and the number of sugary-white 
leaves is modeled by a random variable Nz with a Bin(n,p2) distribution, 
where p2 = +0. Consider the following two estimators for @: 


4 4 
T = —N, —2 and T2 = —Np. 
n n 


In Exercise 19.7 you showed that both JT, and 7> are unbiased estimators 
for 0. Which estimator would you prefer? Motivate your answer. 


20.8 & Let X,, and Y,, be the sample means of two independent random 
samples of size n (resp. m) from the same distribution with mean pu. We 
combine these two estimators to a new estimator T’ by putting 


T=rX,+ (1- 1) Vaves 
where r is some number between 0 and 1. 


a. Show that T is an unbiased estimator for the mean p. 
b. Show that T is most efficient when r = n/(n +m). 


20.9 Given is a random sample X1, X2,...,X, from a Ber(p) distribution. 
One considers the estimators 


1 
Ty) = —(X, +--+ Xn) and Th =min{X),..., Xp}. 
n 


a. Are T; and T> unbiased estimators for p? 
b. Show that 


i 
MSE(Ty) = —p(1—p), MSE(Z2) = p" — 2p"? +p. 


c. Which estimator is more efficient when n = 2? 


20.10 Suppose we have a random sample Xj,...,X, from an Exp(A) distri- 
bution. We want to estimate the expectation 1/. According to Section 19.4, 


310 20 Efficiency and mean squared error 


1 
Xn (Xy + Xo +--+ Xp) 


~n 
is an unbiased estimator of 1/A. Let us consider more generally estimators T 
of the form 

T=c-(X,+Xo+-:-+Xn), 


where c is a real number. We are interested in the MSE of these estimators 
and would like to know whether there are choices for c that yield a smaller 
MSE than the choice c = 1/n. 


a. Compute MSE(T) for each c. 


b. For which c does the estimator perform best in the MSE sense? Compare 
this to the unbiased estimator X,, that one obtains for c = 1/n. 


20.11 ©] In Exercise 17.9 we modeled diameters of black cherry trees with the 
linear regression model (without intercept) 


Y;, = Bx; + U; 


for i = 1,2,...,n. As usual, the U; here are independent random variables 
with E[U;]=0, and Var(U;) = 07. 
We considered three estimators for the slope @ of the line y = Gz: the so- 
called least squares estimator T; (which will be considered in Chapter 22), 
the average slope estimator 75, and the slope of the averages estimator T3. 
These estimators are defined by: 


i=1 


—————_—— Tz = — a) T3=5 


ya 


In Exercise 19.8 it was shown that all three estimators are unbiased. Compute 
the MSE of all three estimators. 


Remark: it can be shown that 7) is always more efficient than 73, which in 
turn is more efficient than T3. To prove the first inequality one uses a famous 
inequality called the Cauchy Schwartz inequality; for the second inequality 
one uses Jensen’s inequality (can you see how’). 


20.12 Let X,,Xo,...,Xn represent n draws without replacement from the 
numbers 1,2,...,N with equal probability. The goal of this exercise is to 
compute the distribution of M, in a way other than by the combinatorial 
analysis we did in this chapter. 


a. Compute P(M,, < k), by using, as in Section 8.4, that: 


P(My <k) =P(X1 <k, Xo <k,...,Xn <h). 


b. Derive that 


20.5 Exercises 


n!(N —n)! 
N! 
(k —1)! (N —n)! 


311 


21 


Maximum likelihood 


In previous chapters we could easily construct estimators for various param- 
eters of interest because these parameters had a natural sample analogue: 
expectation versus sample mean, probabilities versus relative frequencies, etc. 
However, in some situations such an analogue does not exist. In this chap- 
ter, a general principle to construct estimators is introduced, the so-called 
maximum likelihood principle. Maximum likelihood estimators have certain 
attractive properties that are discussed in the last section. 


21.1 Why a general principle? 


In Section 4.4 we modeled the number of cycles up to pregnancy by a ran- 
dom variable X with a geometric distribution with (unknown) parameter p. 
Weinberg and Gladen studied the effect of smoking on the number of cycles 
and obtained the data in Table 21.1 for 100 smokers and 486 nonsmokers. 


Table 21.1. Observed numbers of cycles up to pregnancy. 


Number of cycles 1 2 3 4 5 6 7 8 9 10 11 12 >12 
Smokers 29 16 17 4 3 9 45 1 1 «21 ~38 vi 
Nonsmokers 198 107 55 38 18 22 79 5 3 6 6 12 


Source: C.R. Weinberg and B.C. Gladen. The beta-geometric distribution ap- 
plied to comparative fecundability studies. Biometrics, 42(3):547—-560, 1986. 


Is the parameter p, which equals the probability of becoming pregnant after 
one cycle, different for smokers and nonsmokers? Let us try to find out by 
estimating p in the two cases. 


314 21 Maximum likelihood 


What would be reasonable ways to estimate p? Since p = P(X = 1), the law 
of large numbers (see Section 13.3) motivates use of 
gS number of X; equal to 1 
eS 
as an estimator for p. This yields estimates p = 29/100 = 0.29 for smokers and 
p = 198/486 = 0.41 for nonsmokers. We know from Section 19.4 that S is an 
unbiased estimator for p. However, one cannot escape the feeling that S isa 
“bad” estimator: S does not use all the information in the table, i.e., the way 
the women are distributed over the numbers 2,3,... of observed numbers of 
cycles is not used. One would like to have an estimator that incorporates all 
the available information. Due to the way the data are given, this seems to be 
difficult. For instance, estimators based on the average cannot be evaluated, 
because 7 smokers and 12 nonsmokers had an unknown number of cycles 
up to pregnancy (larger than 12). If one simply ignores the last column in 
Table 21.1 as we did in Exercise 17.5, the average can be computed and yields 
1/Z93 = 0.2809 as an estimate of p for smokers and 1/%474 = 0.3688 for 
nonsmokers. However, because we discard seven values larger than 12 in case 
of the smokers and twelve values larger than 12 in case of the nonsmokers, we 
overestimate p in both cases. 


In the next section we introduce a general principle to find an estimate for a 
parameter of interest, the maximum likelihood principle. This principle yields 
good estimators and will solve problems such as those stated earlier. 


21.2 The maximum likelihood principle 


Suppose a dealer of computer chips is offered on the black market two batches 
of 10000 chips each. According to the seller, in one batch about 50% of the 
chips are defective, while this percentage is about 10% in the other batch. Our 
dealer is only interested in this last batch. Unfortunately the seller cannot tell 
the two batches apart. To help him to make up his mind, the seller offers our 
dealer one batch, from which he is allowed to select and test 10 chips. After 
selecting 10 chips arbitrarily, it turns out that only the second one is defective. 
Our dealer at once decides to buy this batch. Is this a wise decision? 


With the batch where 50% of the chips are defective it is more likely that 
defective chips will appear, whereas with the other batch one would expect 
hardly any defective chip. Clearly, our dealer chooses the batch for which it is 
most likely that only one chip is defective. This is also the guiding idea behind 
the maximum likelihood principle. 


THE MAXIMUM LIKELIHOOD PRINCIPLE. Given a dataset, choose 
the parameter(s) of interest in such a way that the data are most 
likely. 


21.2 The maximum likelihood principle 315 


Set R; = 1 in case the ith tested chip was defective and R; = 0 in case it 
was operational, where i = 1,...,10. Then R1,..., Rig are ten independent 
Ber(p) distributed random variables, where p is the probability that a ran- 
domly selected chip is defective. The probability that the observed data occur 
is equal to 


P(R,; = 0, Rp = 1, R3 = 0,..., Rig = 0) = p(1 — p)?. 


For the batch where about 10% of the chips are defective we find that 


Ly oy" 
P(R, = 0, Ro =1, Rs 0... Fo =9) = (7) = 0.039, 


ifi\* 
P(R, = 0, Rp = 1, R3 0... Fan =9) = 5(5) = 0.00098. 


So the probability for the batch with only 10% defective chips is about 40 
times larger than the probability for the other batch. Given the data, our 
dealer made a sound decision. 


QUICK EXERCISE 21.1 Which batch should the dealer choose if only the first 
three chips are defective? 


Returning to the example of the number of cycles up to pregnancy, denoting 
X;, as the number of cycles up to pregnancy of the ith smoker, recall that 


P(X; =k) =(1—p)*""p 
and 
P(X; > 12) = P(no success in cycle 1 to 12) = (1 — p)”; 


cf. Quick exercise 4.6. From Table 21.1 we see that there are 29 smokers for 
which X; = 1, that there are 16 for which X; = 2, etc. Since we model the 
data as a random sample from a geometric distribution, the probability of the 
data—as a function of p—is given by 


L(p) = C- P(X; = 1)” - P(X; = 2)'°--- P(X; = 12)° - P(X; > 12)’ 
=@ a i . (1 — p)p)*® a (1 —~p)p)* . ((1 =p) *)" 
=C ee : @ =pir*, 


Here C is the number of ways we can assign 29 ones, 16 twos, ..., 3 twelves, 
and 7 numbers larger than 12 to 100 smokers.! According to the marimum 
likelihood principle we now choose p, with 0 < p < 1, in such a way, that L(p) 


1 
C = 311657028822819441451842682167854800096 2636 25208359116504431153487280760832000000000. 


316 21 Maximum likelihood 


is maximal. Since C' does not depend on p, we do not need to know the value 
of C explicitly to find for which p the function L(p) is maximal. 


Differentiating L(p) with respect to p yields that 
L'(p) =C [93p°?(1 = py i 322p°3(1 a pe | 


= Cp” (1 — p)*** [93(1 — p) — 322p] 
= Op**(1 — p)**" (93 — 415p). 


Now L’(p) = 0 if p= 0, p = 1, or p = 93/415 = 0.224, and L(p) attains its 
unique maximum in this last point (check this!). We say that 93/415 = 0.224 is 
the maximum likelihood estimate of p for the smokers. Note that this estimate 
is quite a lot smaller than the estimate 0.29 for the smokers we found in the 
previous section, and the estimate 0.2809 you obtained in Exercise 17.5. 


QUICK EXERCISE 21.2 Check that for the nonsmokers the probability of the 
data is given by 


ATA(] _ p)955_ 


L(p) = constant - p p) 


Compute the maximum likelihood estimate for p. 


Remark 21.1 (Some history). The method of maximum likelihood es- 
timation was propounded by Ronald Aylmer Fisher in a highly influential 
paper. In fact, this paper does not contain the original statement of the 
method, which was published by Fisher in 1912 [9], nor does it contain 
the original definition of likelihood, which appeared in 1921 (see [10]). The 
roots of the maximum likelihood method date back as far as 1713, when 
Jacob Bernoulli’s Ars Conjectandi ([1]) was posthumously published. In the 
eighteenth century other important contributions were by Daniel Bernoulli, 
Lambert, and Lagrange (see also [2], [16], and [17]). It is interesting to re- 
mark that another giant of statistics, Karl Pearson, had not understood 
Fisher’s method. Fisher was hurt by Pearson’s lack of understanding, which 
eventually led to a violent confrontation. 


21.3 Likelihood and loglikelihood 


Suppose we have a dataset 71, 72,..., 2%, modeled as a realization of a random 
sample from a distribution characterized by a parameter @. To stress the 
dependence of the distribution on 6, we write 


po(x) 


for the probability mass function in case we have a sample from a discrete 
distribution and 


fo(z) 


21.3 Likelihood and loglikelihood 317 


for the probability density function when we have a sample from a continuous 
distribution. 

For a dataset %1,%2,...,%» modeled as the realization of a random sample 
X ,...,Xpn from a discrete distribution, the maximum likelihood principle 
now tells us to estimate @ by that value, for which the function L(@), given by 


L(0) = P(X, = 41,...,Xn =n) = po(@1)-++ po(&n) 


is maximal. This value is called the maximum likelihood estimate of 6. The 
function L(@) is called the likelihood function. This is a function of 0, deter- 
mined by the numbers 71, %2,...,%n- 

In case the sample is from a continuous distribution we clearly need to de- 
fine the likelihood function L(@) in a way different from the discrete case (if 
we would define L(@) as in the discrete case, one always would have that 
L(@) = 0). For a reasonable definition of the likelihood function we have the 
following motivation. Let fg be the probability density function of X, and 
let ¢ > 0 be some fixed, small number. It is sensible to choose @ in such a 
way, that the probability P(aj —e < X1 <a +e,...,a, —€ < Xn < apn +6) 
is maximal. Since the X; are independent, we find that 


P(ay —E< XS ay te,...,%, —€ < Xn < Un +) 
=P(a, —e< Xi <a te)-:- Pla, —e< Xn < ayn +e) (21.1) 
% fo(r1) fo(w2) +++ fo(an)(2e)”, 
where in the last step we used that (see also Equation (5.1)) 
wite 
P25 43 <nte)= [ fo(x) da © 2e fo(z:). 


Lie 


Note that the right-hand side of (21.1) is maximal whenever the function 
fo(x1) fo(a2)--- fo(an) is maximal, irrespective of the value of ¢. In view of 
this, given a dataset 71,72,...,%n, the likelihood function L(@) is defined by 


L(8) = fo(x1) fo(x2) +++ fo(an) 
in the continuous case. 


MAXIMUM LIKELIHOOD ESTIMATES. The maximum likelihood es- 
timate of 6 is the value t = h(a1,%2,...,%p) that maximizes the 
likelihood function L(@). The corresponding random variable 


T= RiGee) 


is called the maximum likelihood estimator for 0. 


318 21 Maximum likelihood 


As an example, suppose we have a dataset 2% 1,22,...,% modeled as a re- 
alization of a random sample from an Fxp(A) distribution, with probability 
density function given by f(a”) = 0 if « <0 and 


fx(w) = Ae *” for «>0. 
Then the likelihood is given by 


L(A) = fr(@1) fr (#2) +++ fx (En) 
_ New **1 . de *72 nae ern 
= \”. eg A@1te2t—+2n) 


To obtain the maximum likelihood estimate of A it is enough to find the 
maximum of L(A). To do so, we determine the derivative of L(\): 


n 


d n—-1,-A je Fi n AL 
DL) =n e dis —X (Sia)e dis 


i=1 


. AL 
= n—-1,-A DL zi _—__— A P 
n(a e ( A 5 2) 


i=1 
We see that d(L(A)) /dA = 0 if and only if 
1— Az, = 0, 


ie., if \ = 1/Z,. Check that for this value of A the likelihood function L(,) 
attains a maximum! So the maximum likelihood estimator for » is 1/Xn. 


In the example of the number of cycles up to pregnancy of smoking women, 
we have seen that L(p) = C-p®?-(1—p)°?*. The maximum likelihood estimate 
of p was found by differentiating L(p). Differentiating is not always possible, 
as the following example shows. 


Estimating the upper endpoint of a uniform distribution 


Suppose the dataset x, = 0.98, v2 = 1.57, and 73 = 0.31 is the realization 
of a random sample from a U(0,0) distribution with 6 > 0 unknown. The 
probability density function of each X; is now given by f(a) = 0 if x is not 
in [0,0] and 


1 
fola) = 5 for O0<a2<4@. 


The likelihood L(@) is zero if 0 is smaller than at least one of the 2;, and 
equals 1/63 if 6 is greater than or equal to each of the three 2j, ie., 


+ if 6 > max(21,2%2,23) = 1.57 


L(9) = fo(x1) fo(x2) fo(x3) = i 


0 if 0< max(2, 22,73) = 1.57. 


21.3 Likelihood and loglikelihood 319 


0.2 


0 0.31 0.98 1.57 


Fig. 21.1. Likelihood function L(@) of a sample from a U(0,6@) distribution. 


Figure 21.1 depicts this likelihood function. One glance at this figure is enough 
to realize that L(@) attains its maximum at max (x1, 22,03) = 1.57. 


In general, given a dataset 11, ¥2,...,2n originating from a U(0,@) distribu- 
tion, we see that L(@) = 0 if @ is smaller than at least one of the x; and that 
L(@) = 1/6” if @ is greater than or equal to the largest of the x;. We conclude 
that the maximum likelihood estimator of 0 is given by max {X1, X2,..., Xn}. 


Loglikelihood 


In the preceding example it was easy to find the value of the parameter for 
which the likelihood is maximal. Usually one can find the maximum by dif- 
ferentiating the likelihood function L(@). The calculation of the derivative of 
L(@) may be tedious, because L(@) is a product of terms, all involving 0 (see 
also Quick exercise 21.3). To differentiate L(@) we have to apply the product 
rule from calculus. Considering the logarithm of L(@) changes the product of 
the terms involving # into a sum of logarithms of these terms, which makes 
the process of differentiating easier. Moreover, because the logarithm is an in- 
creasing function, the likelihood function L(@) and the loglikelihood function 
£(0), defined by 
(8) = In(L(6)), 


attain their extreme values for the same values of 0. In particular, L(0) is 
maximal if and only if €(@) is maximal. This is illustrated in Figure 21.2 by 
the likelihood function L(p) = C'p®3(1 — p)3? and the loglikelihood function 
€(p) = In(C) + 93 In(p) + 322 In(1 — p) for the smokers. 


In the situation that we have a dataset 71, 72,...,%, modeled as a realiza- 
tion of a random sample from an Ezp(A) distribution, we found as likelihood 
function L(A) = \" - eA(*14+%2+"+2n) | Therefore, the loglikelihood function 
is given by 

L(A) = nIn(A) — A(a1 + 2 +--+ 4+ an). 


320 21 Maximum likelihood 


5-10~13 0 
—28.5 


43.10-+% 


: 300 : 
— — —- -i- ——,_-.—s« al rn) rs | 
0 93/415 0.5 0 93/415 0.5 


0 


Fig. 21.2. The graphs of the likelihood function L(p) and the loglikelihood function 
£(p) for the smokers. 


QUICK EXERCISE 21.3 In this example, use the loglikelihood function ¢(A) to 
show that the maximum likelihood estimate of A equals 1/Z%,. 


Estimating the parameters of the normal distribution 


Suppose that the dataset 71,272,...,@, is a realization of a random sample 
from an N(j1, 07) distribution, with » and o unknown. What are the maximum 
likelihood estimates for 4 and o? 

In this case @ is the vector (u,o), and therefore the likelihood function is a 
function of two variables: 


L(u, a) = Dice D1) Fee (22) a Fisce\ Bru) 
where each f,,,5(x) is the N(,07) probability density function: 


! e-3 (34) 


ov 20 


Fu,o (x) = 


> ToOkeX<K Ow. 


Since 


one finds that 


&(u,o) =In (fuo(1)) ap eae ae (fu,o(@n)) 


= —nIn(o) — nln(V2z) — 52 ((t1 — p)? +--+ + (en — p)?). 


The partial derivatives of @ are 


21.4 Properties of maximum likelihood estimators 321 


It is not hard to show that for these values of w and o the likelihood func- 
tion L(u,0) attains a maximum. We find that Z,, is the maximum likelihood 
estimate for and that 


is the maximum likelihood estimate for oa. 


21.4 Properties of maximum likelihood estimators 


Apart from the fact that the maximum likelihood principle provides a general 
principle to construct estimators, one can also show that maximum likelihood 
estimators have several desirable properties. 


Invariance principle 


In the previous example, we saw that 


is the maximum likelihood estimator for the parameter o of an N(j1, 07) distri- 
bution. Does this imply that D? is the maximum likelihood estimator for 07? 
This is indeed the case! In general one can show that if T is the maximum 
likelihood estimator of a parameter @ and g(@) is an invertible function of 0, 
then g(T) is the maximum likelihood estimator for g(@). 


322 21 Maximum likelihood 


Asymptotic unbiasedness 


The maximum likelihood estimator T may be biased. For example, because 
D2 = *=19°? for the previously mentioned maximum likelihood estimator D2 
of the parameter o? of an N(y, 07) distribution, it follows from Section 19.4 
that 


nm nm nm 


B [D?] =2["—*s2| i eae 


We see that D? is a biased estimator for 07, but also that as n goes to 
infinity, the expected value of D? converges to 7. This holds more generally. 
Under mild conditions on the distribution of the random variables X; under 
consideration (see, e.g., [36]), one can show that asymptotically (that is, as 
the size n of the dataset goes to infinity) maximum likelihood estimators are 
unbiased. By this we mean that if T,, = h(X1, X2,...,X,) is the maximum 
likelihood estimator for a parameter 0, then 
lim E[Z,,] = 6. 


n—-oco 


Asymptotic minimum variance 


The variance of an unbiased estimator for a parameter 6 is always larger than 
or equal to a certain positive number, known as the Cramér-Rao lower bound 
(see Remark 20.2). Again under mild conditions one can show that maxi- 
mum likelihood estimators have asymptotically the smallest variance among 
unbiased estimators. That is, asymptotically the variance of the maximum 
likelihood estimator for a parameter # attains the Cramér-Rao lower bound. 


21.5 Solutions to the quick exercises 


21.1 In the case that only the first three chips are defective, the probability 
that the observed data occur is equal to 


P(Ri = 1,22 = 1, Rg =1,Ra=0,..., Rig =0) = p71 —p)". 


For the batch where about 10% of the chips are defective we find that 


iif oy 
P(R, =1, Rp =1, Rg =1, Ra 0... Fao =0) = (1) (=) = 0.00048, 


whereas for the other batch this probability is equal to (4)°(4)‘ = 0.00098. 
So the probability for the batch with about 50% defective chips is about 2 
times larger than the probability for the other batch. In view of this, it would 


be reasonable to choose the other batch, not the tested one. 


21.6 Exercises 323 
21.2 From Table 21.1 we derive 


L(p) = constant - P(X; = 1)'°8 P(X; = 2)'".-- P(X; = 12)° P(X; > 12)” 
= constant - p'®- (1 — p)p]"°"--- [= p)"tp]°- (=p)? ] 
= constant - p*”4 . (1 — p). 
Here the constant is the number of ways we can assign 198 ones, 107 twos, ..., 
6 twelves, and 12 numbers larger than 12 to 486 nonsmokers. Differentiating 
L(p) with respect to p yields that 
L'(p) = constant - [474p473(1 — p)®°° — 955p*"4(1 — p)9**] 
= constant - p*73(1 — p)%4 [474(1 — p) — 955p] 
= constant - p*73(1 — p)9°4(474 — 1429p). 


Now L’(p) = 0 if p= 0, p = 1, or p = 474/1429 = 0.33, and L(p) attains its 
unique maximum in this last point. 


21.3 The loglikelihood function L(A) has derivative 


CO) =5 (a, +@Q+---4 oy) =n (5-40). 


One finds that @’(A) = 0 if and only if \ = 1/Z,, and that this is a maximum. 
The maximum likelihood estimate for A is therefore 1/Zp. 


21.6 Exercises 


21.1 4 Consider the following situation. Suppose we have two fair dice, D; 
with 5 red sides and 1 white side and Dz with 1 red side and 5 white sides. 
We pick one of the dice randomly, and throw it repeatedly until red comes 
up for the first time. With the same die this experiment is repeated two more 
times. Suppose the following happens: 


First experiment: first red appears in 3rd throw 
Second experiment: first red appears in 5th throw 
Third experiment: first red appears in 4th throw. 


Show that for die D, this happens with probability 5.7424 - 1078, and for 
die Dz the probability with which this happens is 8.9725 - 10-4. Given these 
probabilities, which die do you think we picked? 


21.2 LF] We throw an unfair coin repeatedly until heads comes up for the first 
time. We repeat this experiment three times (with the same coin) and obtain 
the following data: 


324 21 Maximum likelihood 


First experiment: heads first comes up in 3rd throw 
Second experiment: heads first comes up in 5th throw 
Third experiment: heads first comes up in 4th throw. 


Let p be the probability that heads comes up in a throw with this coin. 
Determine the maximum likelihood estimate p of p. 


21.3 In Exercise 17.4 we modeled the hits of London by flying bombs by a 
Poisson distribution with parameter j. 


a. Use the data from Exercise 17.4 to find the maximum likelihood estimate 
of pL. 

b. Suppose the summarized data from Exercise 17.4 got corrupted in the 
following way: 


Number of hits Qorl 2 3 456 7 
Number of squares 440 93 35 70 0 1 


Using this new data, what is the maximum likelihood estimate of ju? 


21.4 H In Section 19.1, we considered the arrivals of packages at a network 
server, where we modeled the number of arrivals per minute by a Pois(j1) 
distribution. Let 21, %2,...,2%p be a realization of a random sample from a 
Pois(j) distribution. We saw on page 286 that a natural estimate of the 
probability of zeros in the dataset is given by 


number of x; equal to zero 
. : 


a. Show that the likelihood L(y) is given by 


en 
Li+Tat+@n, 


ry!+++Xy! 


b. Determine the loglikelihood ¢(j) and the formula of the maximum likeli- 
hood estimate for ju. 

c. What is the maximum likelihood estimate for the probability e~“ of zero 
arrivals? 


21.5 LH Suppose that 21,272,...,@%, is a dataset, which is a realization of a 
random sample from a normal distribution. 


a. Let the probability density of this normal distribution be given by 


1 
File) = e-2(®-#)” for —00 < @ < 00. 


Von 


Determine the maximum likelihood estimate for p. 


21.6 Exercises 325 


b. Now suppose that the density of this normal distribution is given by 
1 1 


2 2 
fa(x) = e-27/7 for —00 < £ < 00. 
ov 20 


Determine the maximum likelihood estimate for co. 


21.6 Let 71,2%2,...,@%, be a dataset that is a realization of a random sample 
from a distribution with probability density fs(a) given by 


e@-9) for x >6 
fo(z) = i for 2 <0. 


a. Draw the likelihood L(6). 


b. Determine the maximum likelihood estimate for 6. 


21.7 FH Suppose that x71, %2,..., 2%, is a dataset, which is a realization of a ran- 
dom sample from a Rayleigh distribution, which is a continuous distribution 
with probability density function given by 


fo(x) = mo for z > 0. 


In this case what is the maximum likelihood estimate for 6? 


21.8 & (Exercises 19.7 and 20.7 continued) A certain type of plant can be di- 
vided into four types: starchy-green, starchy-white, sugary-green, and sugary- 
white. The following table lists the counts of the various types among 3839 
leaves. 


Type Count 
Starchy-green 1997 
Sugary-white 32 
Starchy-white 906 
Sugary-green 904 


Setting 


if the observed leave is of type starchy-green 
if the observed leave is of type sugary-white 
if the observed leave is of type starchy-white 


ew dNY 


if the observed leave is of type sugary-green, 


the probability mass function p of X is given by 


326 21 Maximum likelihood 


and p(a) = 0 for all other a. Here 0 < 6 < 1 is an unknown parameter, 
which was estimated in Exercise 19.7. We want to find a maximum likelihood 
estimate of 6. 


a. Use the data to find the likelihood L(0) and the loglikelihood ¢(6). 

b. What is the maximum likelihood estimate of 6 using the data from the 
preceding table? 

c. Suppose that we have the counts of n different leaves: n; of type starchy- 
green, n2 of type sugary-white, n3 of type starchy-white, and n4 of type 
sugary-green (so n = n1 + ng +n3 +4). Determine the general formula 
for the maximum likelihood estimate of @. 


21.9 ©) Let 71, 72,...,%, be a dataset that is a realization of a random sample 
from a U(a, 3) distribution (with a@ and @ unknown, a < 3). Determine the 
maximum likelihood estimates for a and (. 


21.10 Let 71,22,...,@, be a dataset, which is a realization of a random 
sample from a Par(qa) distribution. What is the maximum likelihood estimate 
for a? 


21.11 H In Exercise 4.13 we considered the situation where we have a box 
containing an unknown number—say N—of identical bolts. In order to get an 
idea of the size of N we introduced three random variables X, Y, and Z. Here 
we will use X and Y, and in the next exercise Z, to find maximum likelihood 
estimates of N. 


a. Suppose that 71, 2%2,..., 2p is a dataset, which is a realization of a random 
sample from a Geo(1/N) distribution. Determine the maximum likelihood 
estimate for N. 

b. Suppose that y1, y2,---,Yn is a dataset, which is a realization of a random 
sample from a discrete uniform distribution on 1,2,...,N. Determine the 
maximum likelihood estimate for N. 


21.12 (Exercise 21.11 continued.) Suppose that m bolts in the box were 
marked and then r bolts were selected from the box; Z is the number of 
marked bolts in the sample. (Recall that it was shown in Exercise 4.13 ¢ that 
Z has a hypergeometric distribution, with parameters m, N, and r.) Suppose 
that k bolts in the sample were marked. Show that the likelihood L(N) is 


given by 
my) (N-—m 
(ie) Ce) 
a. 
Next show that L(N) increases for N < mr/k and decreases for N > mr/k, 
and conclude that mr/k is the maximum likelihood estimate for N. 


L(N) = 


21.13 Often one can model the times that customers arrive at a shop rather 
well by a Poisson process with (unknown) rate A (customers/hour). On a 
certain day, one of the attendants noticed that between noon and 12.45 p.m. 


21.6 Exercises 327 


two customers arrived, and another attendant noticed that on the same day 
one customer arrived between 12.15 and 1 p.m. Use the observations of the 
attendants to determine the maximum likelihood estimate of X. 


21.14 A very inexperienced archer shoots n times an arrow at a disc of (un- 
known) radius 0. The disc is hit every time, but at completely random places. 
Let r1,72,...,1% be the distances of the various hits to the center of the disc. 
Determine the maximum likelihood estimate for 6. 


21.15 On January 28, 1986, the main fuel tank of the space shuttle Challenger 
exploded shortly after takeoff. Essential in this accident was the leakage of 
some of the six O-rings of the Challenger. In Section 1.4 the probability of 
failure of an O-ring is given by 

eatbht 


PU) = Garo 


where ¢ is the temperature at launch in degrees Fahrenheit. In Table 21.2 
the temperature t (in °F, rounded to the nearest integer) and the number of 
failures N for 23 missions are given, ordered according to increasing temper- 
atures. (See also Figure 1.3, where these data are graphically depicted.) Give 
the likelihood L(a,b) and the loglikelihood (a, b). 


Table 21.2. Space shuttle failure data of pre-Challenger missions. 


t 53 57 58 63 66 67 67 67 
N 2 1 1 1 0 0 0 0 


t 68 69 70 70 70 70 72 73 
N 0 0 0 0 1 1 +40 +0 


t 75 75 76 76 78 79 81 
N 0 2 0 0 0 0 0 


21.16 In the 18th century Georges-Louis Leclerc, Comte de Buffon (1707- 
1788) found an amusing way to approximate the number 7 using probability 
theory and statistics. Buffon had the following idea: take a needle and a large 
sheet of paper, and draw horizontal lines that are a needle-length apart. Throw 
the needle a number of times (say n times) on the sheet, and count how often it 
hits one of the horizontal lines. Say this number is s,,, then s,, is the realization 
of a Bin(n, p) distributed random variable S,,. Here p is the probability that 
the needle hits one of the horizontal lines. In Exercise 9.20 you found that 


p = 2/7. Show that 


2n 
T=— 
Sn 


is the maximum likelihood estimator for 7. 


22 


The method of least squares 


The maximum likelihood principle provides a way to estimate parameters. The 
applicability of the method is quite general but not universal. For example, 
in the simple linear regression model, introduced in Section 17.4, we need to 
know the distribution of the response variable in order to find the maximum 
likelihood estimates for the parameters involved. In this chapter we will see 
how these parameters can be estimated using the method of least squares. 
Furthermore, the relation between least squares and maximum likelihood will 
be investigated in the case of normally distributed errors. 


22.1 Least squares estimation and regression 


Recall from Section 17.4 the simple linear regression model for a bivariate 


dataset (21, 1), (2, y2),---;(@n, Yn). In this model x1, 22,...,%, are non- 
random and yj, y2,---,Yn are realizations of random variables Y), Y2,..., Yn 
satisfying 


Y¥, =a+ Ba, 4+ U; for i=1,2,...,n, 

where U;, U2,...,U,, are independent random variables with zero expectation 
and variance 07. How can one obtain estimates for the parameters a, 3, and 0? 
in this model? 
Note that we cannot find maximum likelihood estimates for these parameters, 
simply because we have no further knowledge about the distribution of the U; 
(and consequently of the Y;). We want to choose a and (3 in such a way that 
we obtain a line that fits the data best. A classical approach to do this is to 
consider the sum of squared distances between the observed values y; and the 
values a + Gx; on the regression line y = a+ Bx. See Figure 22.1, where these 
distances are indicated. The method of least squares prescribes to choose a 
and @ such that the sum of squares 

n 

S(a,8) = > (yi — a Be)? 


i=1 


330 22 The method of least squares 


The point (ai, yi) ‘The regression 


line y=axr= 


a a | 
Li 
Fig. 22.1. The observed value y; corresponding to x; and the value a+ Gx; on the 
regression line y = a+ (zx. 


is minimal. The ith term in the sum is the squared distance in the vertical 
direction from (2#;,y;) to the line y = a+ Gx. To find these so-called least 
squares estimates, we differentiate S(a, 3) with respect to a and 3, and we 
set the derivatives equal to 0: 


ra) n 

7a 5(0 8) =0 2, (wa Ber) = 0 

578(a.8) =0 & So (yi—a- Bai) a; =0. 
i=1 


This is equivalent to 


nat Yon om 
aate ya = Yan 


For example, for the timber data from Table 15.5 we would obtain 


36 a+ 1646.4 6 = 52901 
1646.4 a+ 81750.02 6 = 2790525. 


These are two equations with two unknowns a and (. Solving for a and 3 
yields the solutions @ = —1160.5 and B = 57.51. In Figure 22.2 a scatterplot of 
the timber dataset, together with the estimated regression line y = —1160.5+ 
57.512, is depicted. 


QUICK EXERCISE 22.1 Suppose you are given a piece of Australian timber with 
density 65. What would you choose as an estimate for the Janka hardness? 


22.1 Least squares estimation and regression 331 


Hardness 


20 30 40 50 60 70 80 


Wood density 


Fig. 22.2. Scatterplot and estimated regression line for the timber data. 


n 
i=l? 


estimates @ (the intercept) and B (the slope): 


4 my Biyis — (0) 21) 7H) 
Pe ee (ai? a 


& = Gn — BEn. (22:9) 


In general, writing )> instead of > we find the following formulas for the 


Since $(a, 3) is an elliptic paraboloid (a “vase”), it follows that (4, 3) is the 
unique minimum of S(a, 3) (except when all x; are equal). 


QUICK EXERCISE 22.2 Check that the line y= @+ Ba always passes through 
the “center of gravity” (Zn, Jn). 


Least squares estimators are unbiased 


We denote the least squares estimates by @ and . It is quite common to also 
denote the least squares estimators by @ and (3: 
= 4 A iYi — i Yi 
a=v,—jm,  p-RoaM- Caley 
nia} — (Yo ai) 
In Exercise 22.12 it is shown that B is an unbiased estimator for @. Using this 
and the fact that E[Y;] = a+ Gx; (see page 258), we find for a: 


[4] 


l| 

| 
8 
3 


E[Y,] —2 B34] as 1S EI a3 


i=1 
=a. 


We see that @ is an unbiased estimator for a. 


332 22 The method of least squares 


An unbiased estimator for co? 


In the simple linear regression model the assumptions imply that the random 
variables Y; are independent with variance o?. Unfortunately, one cannot ap- 
ply the usual estimator (1/(n — 1)) 07, (Yi - ¥,)" for the variance of the 
Y;, (see Section 19.4), because different Y; have different expectations. What 
would be a reasonable estimator for 7?? The following quick exercise suggests 
a candidate. 


QUICK EXERCISE 22.3 Let U,,U2,...,Un be independent random variables, 
each with expected value zero and variance a7. Show that 


1 n 
se 


is an unbiased estimator for o?. 


At first sight one might be tempted to think that the unbiased estimator T 
from this quick exercise is a useful tool to estimate o?. Unfortunately, we only 
observe the x; and Y;, not the U;. However, from the fact that U; = Y;-a—(x;, 
it seems reasonable to try 


134-4 = Bos)? (22.3) 


as an estimator for o?. Tedious calculations show that the expected value of 
this random variable equals Ba? g? But then we can easily turn it into an 
unbiased estimator for o?. 


AN UNBIASED ESTIMATOR FOR a”. In the simple linear regression 
model the random variable 


Lo c 
2 ee ee ; BH 


is an unbiased estimator for o?. 


22.2 Residuals 


A way to explore whether the simple linear regression model is appropriate 
to model a given bivariate dataset is to inspect a scatterplot of the so-called 
residuals r; against the x;. The ith residual r; is defined as the vertical distance 
between the 7th point and the estimated regression line: 


T, =Yyi—-a— Pu, $2 12a Ms 


22.2 Residuals 333 


When a linear model is appropriate, the scatterplot of the residuals r; against 
the xz; should show truly random fluctuations around zero, in the sense that 
it should not exhibit any trend or pattern. This seems to be the case in 


Figure 22.3, which shows the residuals for the black cherry tree data from 
Exercise 17.9. 


0.15, 
0.10 ; : 
_ 0.05 = 
g . 
* eo 
= 0.00 -. ee 
o foe =e 
fan e e 
—0.05 
—0.10 on 2 
—0.15 


Fig. 22.3. Scatterplot of r; versus x; for the black cherry tree data. 


QUICK EXERCISE 22.4 Recall from Quick exercise 22.2 that (Zp, Jn) is on the 


regression line y = @+ Be, ie., that J, = @+ BEn. Use this to show that 
> r;, =0,ie., that the sum of the residuals is zero. 


In Figure 22.4 we depicted r; versus x; for the timber dataset. In this case a 
slight parabolic pattern can be observed. Figures 22.2 and 22.4 suggest that 


800 


600 ° 


Residual 
i) 
S 


20 30 40 50 60 70 


Fig. 22.4. Scatterplot of r; versus x; for the timber data with the simple linear 
regression model Y; = a+ Bx; + Ui. 


334 22 The method of least squares 
for the timber dataset a better model might be 
Y; = a+ Ba; + ya? + U; for i=1,2,... 
In this new model the residuals are 
r= yi —G — Ba, — 422, 
where @, 3, and * are the least squares estimates obtained by minimizing 
“ 2 
> (yi — a — Ba; — yx?) 
i=1 


In Figure 22.5 we depicted r; versus x;. The residuals display no trend or 
pattern, except that they “fan out”—an example of a phenomenon called 
heteroscedasticity. 


800 


Residual 
i) 
S 


20 30 40 50 60 70 


Fig. 22.5. Scatterplot of r; versus x; for the timber data with the model Y; = 
a+ Bay + yx? + Ui. 


Heteroscedasticity 


The assumption of equal variance of the U; (and therefore of the Y;) is called 
homoscedasticity. In case the variance of Y; depends on the value of x;, we 
speak of heteroscedasticity. For instance, heteroscedasticity occurs when Y; 
with a large expected value have a larger variance than those with small ex- 
pected values. This produces a “fanning out” effect, which can be observed 
in Figure 22.5. This figure strongly suggests that the timber data are het- 
eroscedastic. Possible ways out of this problem are a technique called weighted 
least squares or the use of variance-stabilizing transformations. 


22.3 Relation with maximum likelihood 335 


22.3 Relation with maximum likelihood 


To apply the method of least squares no assumption is needed about the type 
of distribution of the U;. In case the type of distribution of the U; is known, 
the maximum likelihood principle can be applied. Consider, for instance, the 
classical situation where the U; are independent with an N(0, 07) distribution. 
What are the maximum likelihood estimates for a and (3? 

In this case the Y; are independent, and Y; has an N(a+ 62;, 07) distribution. 
Under these assumptions and assuming that the linear model is appropriate 
to model a given bivariate dataset, the r; should look like the realization of a 
random sample from a normal distribution. As an example a histogram of the 
residuals r; of the cherry tree data of Exercise 17.9 is depicted in Figure 22.6. 


i 


bo 


[—-4 =e tT a TE 
—0.2 —0.1 0.0 0.1 0.2 


Fig. 22.6. Histogram of the residuals r; for the black cherry tree data. 


The data do not exhibit strong evidence against the assumption of normality. 
When Y; has an N(a + @x;,07) distribution, the probability density of Y; is 
given by 


1 
fily) = ea -Bax)*/(20") for —oo<y<oo. 


Since 


the loglikelihood is: 
&(a, 8,0) = In(fi(yi)) +--+ + In (fn(Yn)) 


= —nIn(o) — nln(V20) —- — 


336 22 The method of least squares 


Note that for any fixed o > 0, the loglikelihood ¢(a, 3,c) attains its maximum 
precisely when 5>;"_,(y; — @ — G2;)? is minimal. Hence, in case the U; are 
independent with an N(0, 07) distribution, the maximum likelihood principle 
and the least squares method yield the same estimators. 


To find the maximum likelihood estimate of o we differentiate €(a, 3,0) with 
respect to o: 


It follows (from the invariance principle on page 321) that the maximum 
likelihood estimator of o? is given by 


which is the estimator from (22.3). 


22.4 Solutions to the quick exercises 


22.1 We can use the estimated regression line y = —1160.5+57.51z2 to predict 
the Janka hardness. For density « = 65 we find as a prediction for the Janka 
hardness y = 2577.65. 


22.2 Rewriting @ = Yn — B, it follows that 7, = @+ BEn, which means that 
(Zn; Yn) is a point on the estimated regression line y = @+ Ga. 


22.3 We need to show that E[T] = o?. Since E[Uj] = 0, Var(U;) = E[U?], 
so that: 


=»? 52a] = = + Del [ovis vant 


22.4 Since rj = y; — (@ + Ga;) for i = 1,2,...,n, it follows that the sum of 
the residuals equals 


Son =o HR (na+8>> xi) 


= Nn - (na - nn) =n (in es Bn)) =0, 


because Yn, = A+ Bn, according to Quick exercise 22.2. 


22.5 Exercises 337 


22.5 Exercises 


22.1 H Consider the following bivariate dataset: 
(1,2) (3,1.8) (5,1). 


a. Determine the least squares estimates @ and B of the parameters of the 
regression line y = a+ Gx. 

b. Determine the residuals r1,r2, and rg and check that they add up to 0. 

c. Draw in one figure the scatterplot of the data and the estimated regression 
line y= @+ Ba. 


22.2 Adding one point may dramatically change the estimates of a and (. 
Suppose one extra datapoint is added to the dataset of the previous exercise 
and that we have as dataset: 


(0,0) (1,2) (8,1.8) (5,1). 


Determine the least squares estimate of 3. A point such as (0,0), which dra- 
matically changes the estimates for a and {, is called a leverage point. 


22.3 Suppose we have the following bivariate dataset: 
(1,3.1) (1.7,3.9) (2.1,3.8) (2.5,4.7) (2.7,4.5). 


a. Determine the least squares estimates @ and 3 of the parameters of the 
regression line y = a+ Gx. You may use that > x2; = 10, Sy; = 20, 
>> a? = 21.84, and > ayy; = 41.61. 

b. Draw in one figure the scatterplot of the data and the estimated regression 
line y = & + Ga. 


22.4 We are given a bivariate dataset (x1, yi), (@2,y2),---; (2100, Yi00). For 
this bivariate dataset it is known that )> 2; = 231.7, )> 2? = 2400.8, > ys = 
321, and }> x,y; = 5189. What are the least squares estimates @ and B of the 
parameters of the regression line y = a + Gx? 


22.5 H For the timber dataset it seems reasonable to leave out the intercept @ 
(“no hardness without density”). The model then becomes 


Y; = Ba, + U; for *1=1,2,...,n. 


Show that the least squares estimator @ of @ is now given by 


by minimizing the appropriate sum of squares. 


338 22 The method of least squares 


22.6 ©) (Quick exercise 22.1 and Exercise 22.5 continued). Suppose we are 
given a piece of Australian timber with density 65. What would you choose 
as an estimate for the Janka hardness, based on the regression model with 
no intercept? Recall that 37 ay; = 2790525 and S> 2? = 81750.02 (see also 
Section 22.1). 


22.7 Consider the dataset 


(21,41), (x2, Yy2); cia ae) (Tas Ui) 


where 71, %2,...,, are nonrandom and yj, y2,-.--, Yn are realizations of ran- 
dom variables Y,, Y2,..., Yn, satisfying 


Y,=e*8%4U;, for i=1,2,...,n. 


Here U,,U2,...,Un are independent random variables with zero expectation 
and variance o?. What are the least squares estimates for the parameters a 
and ( in this model? 


22.8 LE] Which simple regression model has the larger residual sum of squares 


yo 77, the model with intercept or the one without? 


22.9 For some datasets it seems reasonable to leave out the slope (3. For 
example, in the jury example from Section 6.3 it was assumed that the score 
that juror 7 assigns when the performance deserves a score g is Y; = g+ 4, 
where Z; is a random variable with values around zero. In general, when the 
slope 3 is left out, the model becomes 


Y,=a+U, fori =1,2,...,n. 


Show that Y,, is the least squares estimator @ of a. 


22.10 ©] In the method of least squares we choose a and ( in such a way 
that the sum of squared residuals S(q, 3) is minimal. Since the ith term in 
this sum is the squared vertical distance from (x;,y;) to the regression line 
y = a+ Bx, one might also wonder whether it is a good idea to replace this 
squared distance simply by the distance. So, given a bivariate dataset 


(£1, Y1); (2a, yo); aces Cia tal 


choose a and ( in such a way that the sum 
A(a, 8) = SJ |yi — a — Bai| 
i=1 


is minimal. We will investigate this by a simple example. Consider the follow- 
ing bivariate dataset: 
(0, 2), (1, 2), (2,0). 


22.5 Exercises 339 


a. Determine the least squares estimates @ and B, and draw in one figure 
the scatterplot of the data and the estimated regression line y = a + (x. 
Finally, determine A(d, (3). 


b. One might wonder whether @ and B also minimize A(a, 3). To investigate 
this, choose @ = —1 and find a’s for which A(a, —1) < A(G, 3). For which 
ais A(a,—1) minimal? 

c. Find a and @ for which A(a, 3) is minimal. 


22.11 Consider the dataset (x1, y1), (2, Y2),---;(@n, Yn), Where the x; are 
nonrandom and the y; are realizations of random variables Y,, Y2,..., Yn sat- 
isfying 

Y; = g(a;) + U; fori =1,2,...,n, 


where Uj, U2,...,Un are independent random variables with zero expecta- 
tion and variance o?. Visual inspection of the scatterplot of our dataset in 


2500 

2000 

1500 ° 
1000 ee 


500 - 


20 30 40 50 60 70 80 


Fig. 22.7. Scatterplot of y; versus x;. 


Figure 22.7 suggests that we should model the Y; by 
Y, = Ba, + yx? +U; fori=1,2,...,n. 
a. Show that the least squares estimators B and + satisfy 
BY tit+y> 3 =) lam, 
BYlaity dat => ciyi. 


b. Infer from a—for instance, by using linear algebra—that the estimators 
@ and ¥ are given by 


(2 2i¥i) Ol 2) — 2 27) 2 27¥;) 


B= (ade) — Ona? 


340 22 The method of least squares 


nd 
210, 0)=— O58)" 


22.12 H The least square estimator @ from (22.1) is an unbiased estimator 
for @. You can show this in four steps. 


a. First show that 


p[j] — Beeb Carey 
n>) xi — ()) 2%) 


b. Next use that E[Y;] = a+ Ga;, to obtain that 


n> aji(at Ba;) — (D5 2;) Ina+ Bd ai) 
n>. 2? — (Soai)" 


c. Simplify this last expression to find 


oS 


[i= 


d. Finally, conclude that B is an unbiased estimator for (. 


23 


Confidence intervals for the mean 


Sometimes, a range of plausible values for an unknown parameter is preferred 
to a single estimate. We shall discuss how to turn data into what are called 
confidence intervals and show that this can be done in such a manner that 
definite statements can be made about how confident we are that the true pa- 
rameter value is in the reported interval. This level of confidence is something 
you can choose. We start this chapter with the general principle of confidence 
intervals. We continue with confidence intervals for the mean, the common 
way to refer to confidence intervals made for the expected value of the model 
distribution. Depending on the situation, one of the four methods presented 
will apply. 


23.1 General principle 


In previous chapters we have encountered sample statistics as estimators for 
distribution features. This started somewhat informally in Chapter 17, where 
it was claimed, for example, that the sample mean and the sample variance 
are usually close to 4 and o? of the underlying distribution. Bias and MSE 
of estimators, discussed in Chapters 19 and 20, are used to judge the quality 
of estimators. If we have at our disposal an estimator T for an unknown 
parameter 0, we use its realization t as our estimate for 6. For example, when 
collecting data on the speed of light, as Michelson did (see Section 13.1), the 
unknown speed of light would be the parameter 0, our estimator T could 
be the sample mean, and Michelson’s data then yield an estimate ¢ for 0 of 
299 852.4 km/sec. We call this number a point estimate: if we are required 
to select one number, this is it. Had the measurements started a day earlier, 
however, the whole experiment would in essence be the same, but the results 
might have been different. Hence, we cannot say that the estimate equals the 
speed of light but rather that it is close to the true speed of light. For example, 
we could say something like: “we have great confidence that the true speed of 


342 23 Confidence intervals for the mean 


light is somewhere between ... and... .” In addition to providing an interval 
of plausible values for 6 we would want to add a specific statement about how 
confident we are that the true @ is among them. 


In this chapter we shall present methods to make confidence statements about 
unknown parameters, based on knowledge of the sampling distributions of cor- 
responding estimators. To illustrate the main idea, suppose the estimator T 
is unbiased for the speed of light @. For the moment, also suppose that T 
has standard deviation or = 100 km/sec (we shall drop this unrealistic as- 
sumption shortly). Then, applying formula (13.1), which was derived from 
Chebyshev’s inequality (see Section 13.2), we find 

P(|T — 6| < 2cr) > 2. (23.1) 
In words this reads: with probability at least 75%, the estimator T is within 
2o7 = 200 of the true speed of light 6. We could rephrase this as 


T € (6 — 200, 6+ 200) with probability at least 75%. 


However, if I am near the city of Paris, then the city of Paris is near me: the 
statement “Tis within 200 of 6” is the same as “@ is within 200 of T,” and 
we could equally well rephrase (23.1) as 


6 € (T — 200, 7 +200) with probability at least 75%. 


Note that of the last two equations the first is a statement about a random 
variable T being in a fixed interval, whereas in the second equation the interval 
is random and the statement is about the probability that the random interval 
covers the fized but unknown @. The interval (T — 200, T + 200) is sometimes 
called an interval estimator, and its realization is an interval estimate. 


Evaluating T for the Michelson data we find as its realization t = 299 852.4, 
and this yields the statement 


6 & (299 652.4, 300 052.4). (23.2) 


Because we substituted the realization for the random variable, we cannot 
claim that (23.2) holds with probability at least 75%: either the true speed of 
light 6 belongs to the interval or it does not; the statement we make is either 
true or false, we just do not know which. However, because the procedure 
guarantees a probability of at least 75% of getting a “right” statement, we 
say: 


6 € (299 652.4, 300052.4) with confidence at least 75%. (23.3) 


The construction of this confidence interval only involved an unbiased estima- 
tor and knowledge of its standard deviation. When more information on the 
sampling distribution of the estimator is available, more refined statements 
can be made, as we shall see shortly. 


23.1 General principle 343 


QUICK EXERCISE 23.1 Repeat the preceding derivation, starting from the 
statement P(|T — 0| < 307) > 8/9 (check that this follows from Chebyshev’s 
inequality). What is the resulting confidence interval for the speed of light, 
and what is the corresponding confidence? 


A general definition 


Many confidence intervals are of the form! 
(t—c-o7r,t+c-or) 


we just encountered, where c is a number near 2 or 3. The corresponding 
confidence is often much higher than in the preceding example. Because there 
are many other ways confidence intervals can (or have to) be constructed, the 
general definition looks a bit different. 


CONFIDENCE INTERVALS. Suppose a dataset 21,...,2, is given, 
modeled as realization of random variables X,,...,X,. Let @ be the 
parameter of interest, and y a number between 0 and 1. If there exist 
sample statistics DL, = g(X1,...,Xn) and U, = h(X%,..., Xn) such 
that 

Pile <0 < Up) =7 


for every value of 0, then 
(is Wisp). 


where I, = g(a1,.-..,%n) and un, = h(x1,...,2n), is called a 1007% 
confidence interval for 0. The number 7¥ is called the confidence level. 


Sometimes sample statistics DZ, and U, as required in the definition do not 
exist, but one can find L,, and U,, that satisfy 


P(Lyn <0<U,)>¥. 


The resulting confidence interval (1, u,) is called a conservative 1007% confi- 
dence interval for 4: the actual confidence level might be higher. For example, 
the interval in (23.2) is a conservative 75% confidence interval. 


QUICK EXERCISE 23.2 Why is the interval in (23.2) a conservative 75% con- 
fidence interval? 


There is no way of knowing whether an individual confidence interval is cor- 
rect, in the sense that it indeed does cover #. The procedure guarantees that 
each time we make a confidence interval we have probability 7 of covering @. 
What this means in practice can easily be illustrated with an example, using 
simulation: 


' Another form is, for example, (cit, cat). 


344 23 Confidence intervals for the mean 


Generate 71,...,%29 from an N(0,1) distribution. Next, pretend that 
it is known that the data are from a normal distribution but that both 
ju and o are unknown. Construct the 90% confidence interval for the 
expectation ys using the method described in the next section, which 
says to use (l,,Un) with 


Let a Sey Er 


v20 v20" 


where Zg9 and sg9 are the sample mean and standard deviation. Fi- 
nally, check whether the “true py,” in this case 0, is in the confidence 
interval. 


We repeated the whole procedure 50 times, making 50 confidence intervals 
for uu. Each confidence interval is based on a fresh independently generated 
set of data. The 50 intervals are plotted in Figure 23.1 as horizontal line 


4 mn 
Fig. 23.1. Fifty 90% confidence intervals for u = 0. 


23.2 Normal data 345 


segments, and at 4 (0!) a vertical line is drawn. We count 46 “hits”: only four 
intervals do not contain the true pu. 


QUICK EXERCISE 23.3 Suppose you were to make 40 confidence intervals with 
confidence level 95%. About how many of them should you expect to be 
“wrong”? Should you be surprised if 10 of them are wrong? 


In the remainder of this chapter we consider confidence intervals for the mean: 
confidence intervals for the unknown expectation pw of the distribution from 
which the sample originates. We start with the situation where it is known that 
the data originate from a normal distribution, first with known variance, then 
with unknown variance. Then we drop the normal assumption, first use the 
bootstrap, and finally show how, for very large samples, confidence intervals 
based on the central limit theorem are made. 


23.2 Normal data 


Suppose the data can be seen as the realization of a sample X1,...,X, from 
an N(,07) distribution and pz is the (unknown) parameter of interest. If the 
variance o? is known, confidence intervals are easily derived. Before we do 
this, some preparation has to be done. 


Critical values 


We shall need so-called critical values for the standard normal distribution. 
The critical value z, of an N(0,1) distribution is the number that has right 
tail probability p. It is defined by 


P(Z a Zp) =P, 


where Z is an N(0,1) random variable. For example, from Table B.1 we read 
P(Z > 1.96) = 0.025, so 20.025 = 1.96. In fact, z» is the (1 — p)th quantile of 
the standard normal distribution: 


O(2,)= PZ < 2.) =1—p. 


By the symmetry of the standard normal density, P(Z < —zp) = P(Z > zp) = 
p, so P(Z > —z,) = 1—p and therefore 


Z1—p = —Zp- 
For example, 20.975 = — 20.025 = —1.96. All this is illustrated in Figure 23.2. 


QUICK EXERCISE 23.4 Determine 29.9, and zo.95 from Table B.1. 


346 23 Confidence intervals for the mean 


Fig. 23.2. Critical values of the standard normal distribution. 


Variance known 


If X,,..., X, isa random sample from an N(, 07) distribution, then X,, has 
an N(,07/n) distribution, and from the properties of the normal distribution 
(see page 106), we know that 


Xn — pb 
o//n 


If c; and c, are chosen such that P(cq;) < Z < cy) = ¥ for an N(0, 1) distributed 
random variable Z, then 


has an N(0,1) distribution. 


We have found that 


In = Xn Cue and Un = Xn — 1 


satisfy the confidence interval definition: the interval (L,,U,) covers uw with 
probability 7. Therefore 


= o _ a 
Ln — ae In — Ta 
is a 1007~% confidence interval for u. A common choice is to divide a= 1-7 


evenly between the tails,” that is, solve c; and c, from 


? Here this choice could be motivated by the fact that it leads to the shortest 
confidence interval; in other examples the shortest interval requires an asymmetric 


23.2 Normal data 347 
P(Z >) =a/2 and P(Z<c)=a/2, 


so that cy = 2/2 and ¢) = 2~a/2 = —%a/2- Summarizing, the 100(1 — a)% 
confidence interval for ju is: 


= CG _ o 

In — Paifa Ty) In + *a/2 Te, . 
For example, if a = 0.05, we use zo.925 = 1.96 and the 95% confidence interval 
is 


oO oO 
Ly, — 1.96 —=, FZ, + 1.96 — }. 
( awe <) 


Example: gross calorific content of coal 


When a shipment of coal is traded, a number of its properties should be known 
accurately, because the value of the shipment is determined by them. An im- 
portant example is the so-called gross calorific value, which characterizes the 
heat content and is a numerical value in megajoules per kilogram (MJ/kg). 
The International Organization of Standardization (ISO) issues standard pro- 
cedures for the determination of these properties. For the gross calorific value, 
there is a method known as ISO 1928. When the procedure is carried out prop- 
erly, resulting measurement errors are known to be approximately normal, 
with a standard deviation of about 0.1 MJ/kg. Laboratories that operate 
according to standard procedures receive ISO certificates. In Table 23.1, a 
number of such ISO 1928 measurements is given for a shipment of Osterfeld 
coal coded 262DE27. 


Table 23.1. Gross calorific value measurements for Osterfeld 262DE27. 


23.870 23.730 23.712 23.760 23.640 23.850 23.840 23.860 
23.940 23.830 23.877 23.700 23.796 23.727 23.778 23.740 
23.890 23.780 23.678 23.771 23.860 23.690 23.800 


Source: A.M.H. van der Veen and A.J.M. Broos. Interlaboratory study pro- 
gramme “ILS coal characterization” —reported data. Technical report, NMi 
Van Swinden Laboratorium B.V., The Netherlands, 1996. 


We want to combine these values into a confidence statement about the “true” 
gross calorific content of Osterfeld 262DE27. From the data, we compute Z, = 
23.788. Using the given o = 0.1 and a = 0.05, we find the 95% confidence 


interval 


0.1 0.1 
23.788 — 1.96, 23.788 + 1.96 — } = (23.747, 23.829) MJ/kg. 
( = =) ) MI/ke 


division of a. If you are only concerned with the left or right boundary of the 
confidence interval, see the next chapter. 


348 23 Confidence intervals for the mean 


Variance unknown 
When o is unknown, the fact that 
Xn- Lt 
o//n 
has a standard normal distribution has become useless, as it involves this un- 


known a, which would subsequently appear in the confidence interval. How- 
ever, if we substitute the estimator S,, for o, the resulting random variable 


Xn —w 
Sn/y/n 


has a distribution that only depends on n and not on p or ao. Moreover, its 
density can be given explicitly. 


DEFINITION. A continuous random variable has a t-distribution with 
parameter m, where m > 1 is an integer, if its probability density is 
given by 


m+1 


25N, ware 
He) = Be (1+ ) for —co < % < oo, 


x 
m 


where k, = T (4+) / (C (4) Vmm). This distribution is denoted 
by ¢(m) and is referred to as the ¢-distribution with m degrees of 
freedom. 


The normalizing constant kh, is given in terms of the gamma function, which 
was defined on page 157. For m = 1, it evaluates to ky = 1/7, and the resulting 
density is that of the standard Cauchy distribution (see page 161). If X has 
a t(m) distribution, then E[X] = 0 for m > 2 and Var(X) = m/(m — 2) 
for m > 3. Densities of t-distributions look like that of the standard normal 
distribution: they are also symmetric around 0 and bell-shaped. As m goes 
to infinity the limit of the t(m) density is the standard normal density. The 
distinguishing feature is that densities of t-distributions have heavier tails: 
f(x) goes to zero as x goes to +00 or —oo, but more slowly than the density 
¢(a) of the standard normal distribution. These properties are illustrated in 
Figure 23.3, which shows the densities and distribution functions of the ¢(1), 
t(2), and ¢(5) distribution as well as those of the standard normal. 


We will also need critical values for the t(m) distribution: the critical value 
tm,p is the number satisfying 


P(L > tmp) =P, 


where T is a t(m) distributed random variable. Because the t-distribution is 
symmetric around zero, using the same reasoning as for the critical values of 
the standard normal distribution, we find: 


0.4 


23.2 Normal data 349 


1.0 


0.5 


—4 —2 0 2 4 


Fig. 23.3. Three t-distributions and the standard normal distribution. The dotted 
line corresponds to the standard normal. The other distributions depicted are the 
t(1), ¢(2), and ¢(5), which in that order resemble the standard normal more and 


more. 


tmj1—p = 


—tm,p- 


For example, in Table B.2 we read t10,0.01 = 2.764, and from this we deduce 


that t10,0.99 = —2.764. 


QUICK EXERCISE 23.5 Determine t3,0.01 and 350.9975 from Table B.2. 


We now return to the distribution of 


Xn ~~ Le 
Sn//n 


and construct a confidence interval for yp. 


THE STUDENTIZED MEAN OF A NORMAL RANDOM SAMPLE. For a 
random sample Xj,...,X, from an N(,o7) distribution, the stu- 


dentized mean 


Xn — pb 
Sr/Jn 


has a t(n — 1) distribution, regardless of the values of y and o. 


From this fact and using critical values of the t-distribution, we derive that 


Xn- 
P{ —ty_-1.0/2 < = < tn-1a =1-a, 23.4 
( n—1,a/2 Bain a 2) a ( ) 


and in the same way as when ¢ is known it now follows that a 100(1 — a)% 


confidence interval for ys is given by: 


350 23 Confidence intervals for the mean 


= Sn _ Sn 
In — tn—1,a/2 Fe In + tn-1,a/2 Fe : 


Returning to the coal example, there was another shipment, of Daw Mill 
258GBA41 coal, where there were actually some doubts whether the stated 
accuracy of the ISO 1928 method was attained. We therefore prefer to consider 
ao unknown and estimate it from the data, which are given in Table 23.2. 


Table 23.2. Gross calorific value measurements for Daw Mill 258GB41. 


30.990 31.030 31.060 30.921 30.920 30.990 31.024 30.929 
31.050 30.991 31.208 30.830 31.330 30.810 31.060 30.800 
31.091 31.170 31.026 31.020 30.880 31.125 


Source: A.M.H. van der Veen and A.J.M. Broos. Interlaboratory study pro- 
gramme “ILS coal characterization” —reported data. Technical report, NMi 
Van Swinden Laboratorium B.V., The Netherlands, 1996. 


Doing this, we find Z, = 31.012 and s, = 0.1294. Because n = 22, for a 95% 
confidence interval we use t21,0.025 = 2.080 and obtain 


0.1294 0.1294 
31.012 — 2.080 ——,, 31.012 + 2.080 —— ]} = (30.954, 31.069). 
( V 22 V 22 ) ( ) 


Note that this confidence interval is (50%!) wider than the one we made for 
the Osterfeld coal, with almost the same sample size. There are two reasons 
for this; one is that o = 0.1 is replaced by the (larger) estimate s, = 0.1294, 
and the second is that the critical value 20.925 = 1.96 is replaced by the larger 
to1,0.025 = 2.080. The differences in the method and the ingredients seem 
minor, but they matter, especially for small samples. 


23.3 Bootstrap confidence intervals 


It is not uncommon that the methods of the previous section are used even 
when the normal distribution is not a good model for the data. In some cases 
this is not a big problem: with small deviations from normality the actual 
confidence level of a constructed confidence interval may deviate only a few 
percent from the intended confidence level. For large datasets the central limit 
theorem in fact ensures that this method provides confidence intervals with 
approximately correct confidence levels, as we shall see in the next section. 


If we doubt the normality of the data and we do not have a large sample, usu- 
ally the best thing to do is to bootstrap. Suppose we have a dataset 71,...,2n, 
modeled as a realization of a random sample from some distribution F’, and 
we want to construct a confidence interval for its (unknown) expectation ju. 


23.3 Bootstrap confidence intervals 351 


In the previous section we saw that it suffices to find numbers c; and c, such 


that 7 
Pl cq < ———= <e,) = 1- 
(a<§ S./ TE < 
The 100(1 — a)% confidence interval would then be 


where, of course, Z, and s, are the sample mean and the sample standard 
deviation. To find c; and c,, we need to know the distribution of the studentized 
mean 


pa oar 
S/n 
We apply the bootstrap principle. From the data 71,...,2%, we determine an 


estimate F’ of F. Let XY,...,X7 be a random wane from F, with p* 
E[X;], and consider 

x= 

Spl Vn 

The distribution of T* is now used as an approximation to the distribution 
of T. If we use F = F,,, we get the following. 


T* = 


EMPIRICAL BOOTSTRAP SIMULATION FOR THE STUDENTIZED MEAN. 
Given a dataset 11, %2,...,2%n, determine its empirical distribution 
function F,, as an estimate of F. The expectation corresponding 
1 JA IS fi = Biro 

1. Generate a bootstrap dataset xj, 75,..., x77, from Fy. 

2. Compute the studentized mean for the bootstrap dataset: 


th 2p, 


sh / Jn” 


where Z* and s* are the sample mean and sample standard de- 
WINTON OL 15 9 Fla 0 0 0 gdb 
Repeat steps 1 and 2 many times. 


a 


From the bootstrap experiment we can determine cj and c%, such that 


xX* _ ut 
P cj <oe <a) wl-a. 
( ha 
By the bootstrap principle we may transfer this statement about the distri- 


bution of T* to the distribution of T. That is, we may use these estimated 
critical values as bootstrap approximations to c; and Cy: 


ac and ty *c,, 


352 23 Confidence intervals for the mean 


Therefore, we call 


(3 ge a ae ) 
no Cy Jn’ no UY 
nr Jn 


a 100(1 — a)% bootstrap confidence interval for p. 


Example: the software data 


Recall the software data, a dataset of interfailure times (see Section 17.3). 
From the nature of the data—failure times are positive numbers—and the 
histogram (Figure 17.5), we know that they should not be modeled as a real- 
ization of a random sample from a normal distribution. From the data we know 
In = 656.88, s, = 1037.3, and n = 135. We generate one thousand bootstrap 
datasets, and for each dataset we compute t* as in step 2 of the procedure. The 
histogram and empirical distribution function made from these one thousand 
values are estimates of the density and the distribution function, respectively, 
of the bootstrap sample statistic T*; see Figure 23.4. 


0.5 
0.4 
0.3 
0.2 


0.1 


Fig. 23.4. Histogram and empirical distribution function of the studentized boot- 
strap simulation results for the software data. 


We want to make a 90% bootstrap confidence interval, so we need cj and cy, 
or the 0.05th and 0.95th quantile from the empirical distribution function in 
Figure 23.4. The 50th order statistic of the one thousand ¢* values is —2.107. 
This means that 50 out of the one thousand values, or 5%, are smaller than 
or equal to this value, and so cf = —2.107. Similarly, from the 951st order 
statistic, 1.389, we obtain® c* = 1.389. Inserting these values, we find the 
following 90% bootstrap confidence interval for ju: 


3 These results deviate slightly from the definition of empirical quantiles as given 
in Section 16.3. That method is a little more accurate. 


23.4 Large samples 353 


1037.3 1037.3 
656.88 — 1.389 ———., 656.88 — (—2.107 = (532.9, 845.0). 
( V135 ( ) V135 ) ( ) 


QUICK EXERCISE 23.6 The 25th and 976th order statistic from the preceding 
bootstrap results are —2.443 and 1.7138, respectively. Use these numbers to 
construct a confidence interval for 4. What is the corresponding confidence 
level? 


Why the bootstrap may be better 


The reason to use the bootstrap is that it should lead to a more accurate 
approximation of the distribution of the studentized mean than the t(n — 1) 
distribution that follows from assuming normality. If, in the previous example, 
we would think we had normal data, we would use critical values from the 
t(134) distribution: t134,0.05 = 1.656. The result would be 


a. 656.88 + 1.656 ee 
V135 V135 


Comparing the intervals, we see that here the bootstrap interval is a little 
larger and, as opposed to the ¢-interval, not centered around the sample mean 
but skewed to the right side. This is one of the features of the bootstrap: 
if the distribution from which the data originate is skewed, this is reflected 
in the confidence interval. Looking at the histogram of the software data 
(Figure 17.5), we see that is it skewed to the right: it has a long tail on the 
right, but not on the left, so the same most likely holds for the distribution 
from which these data originate. The skewness is reflected in the confidence 
interval, which extends more to the right of Z,, than to the left. In some sense, 
the bootstrap adapts to the shape of the distribution, and in this way it leads 
to more accurate confidence statements than using the method for normal 
data. What we mean by this is that, for example, with the normal method 
only 90% of the 95% confidence statements would actually cover the true 
value, whereas for the bootstrap intervals this percentage would be close(r) 
to 95%. 


(56.88 - 1.656 ) = (509.0, 804.7). 


23.4 Large samples 


A variant of the central limit theorem states that as n goes to infinity, the 
distribution of the studentized mean 


Xap 

Sr/yn 
approaches the standard normal distribution. This fact is the basis for so- 
called large sample confidence intervals. Suppose X,,...,Xy is a random 


354 23 Confidence intervals for the mean 


sample from some distribution F’ with expectation y. If n is large enough, 
we may use 


Xn — bb 
Pl —Zase < = < 20 ~1l-a. 23.5 
(ton < Big Sten) #1-8 is 
This implies that if 71,...,2, can be seen as a realization of a random sample 


from some unknown distribution with expectation ys and if n is large enough, 
then 


ee Tig In T %a/27F= 
mr Jn 
is an approximate 100(1 — a)% confidence interval for jp. 


Just as earlier with the central limit theorem, a key question is “how big 
should n be?” Again, there is no easy answer. To give you some idea, we have 
listed in Table 23.3 the results of a small simulation experiment. For each of 
the distributions, sample sizes, and confidence levels listed, we constructed 
10000 confidence intervals with the large sample method; the numbers listed 
in the table are the confidence levels as estimated from the simulation, the 
coverage probabilities. The chosen Pareto distribution is very skewed, and this 
shows; the coverage probabilities for the exponential are just a few percent 
off. 


Table 23.3. Estimated coverage probabilities for large sample confidence intervals 
for non-normal data. 


7 
Distribution n 0.900 0.950 
Exp (1) 20 0.851 0.899 
Exp (1) 100 0.890 0.938 
Par (2.1) 20 0.727 0.774 

( 


Par (2.1) 100 0.798 0.849 


In the case of simulation one can often quite easily generate a very large 
number of independent repetitions, and then this question poses no problem. 
In other cases there may be nothing better to do than hope that the dataset 
is large enough. We give an example where (we believe!) this is definitely the 
case. 


In an article published in 1910 ([28]), Rutherford and Geiger reported their 
observations on the radioactive decay of the element polonium. Using a small 
disk coated with polonium they counted the number of emitted alpha-particles 
during 2608 intervals of 7.5 seconds each. The dataset consists of the counted 
number of alpha-particles for each of the 2608 intervals and can be summarized 
as in Table 23.4. 


23.5 Solutions to the quick exercises 355 


Table 23.4. Alpha-particle counts for 2608 intervals of 7.5 seconds. 


Count 0 1 2 3 4 
Frequency 57 203 383 525 532 


Count 5 6 ile 8 9 
Frequency 408 273 139 45 27 


Count 10 11 12 #13 «214 
Frequency 10 4 0 1 1 


Source: E. Rutherford and H. Geiger (with a note by H. Bateman), The proba- 
bility variations in the distribution of a particles, Phil. Mag., 6: 698-704, 1910; 
the table on page 701. 


The total number of counted alpha-particles is 10097, the average number 
per interval is therefore 3.8715. The sample standard deviation can also 
be computed from the table; it is 1.9225. So we know of the actual data 
X1,02,---,£2608 (where the counts x; are between 0 and 14) that Z,, = 3.8715 
and s, = 1.9225. We construct a 98% confidence interval for the expected 
number of particles per interval. As 20.9; = 2.33 this results in 


1.922 1.922 
(s.s715 - page 3.8715 + 2.33 =) = (3.784, 3.959). 
V 2608 V2608 


23.5 Solutions to the quick exercises 


23.1 From the probability statement, we derive, using op = 100 and 8/9 = 
0.889: 
6 € (T — 300, T + 300) with probability at least 88%. 


With t = 299 852.4, this becomes 
0 &€ (299 552.4, 300152.4) with confidence at least 88%. 


23.2 Chebyshev’s inequality only gives an upper bound. The actual value 
of P(|T — 6| < 207) could be higher than 3/4, depending on the distribution 
of T. For example, in Quick exercise 13.2 we saw that in case of an exponen- 
tial distribution this probability is 0.865. For other distributions, even higher 
values are attained; see Exercise 13.1. 


23.3 For each of the confidence intervals we have a 5% probability that 
it is wrong. Therefore, the number of wrong confidence intervals has a 
Bin(40,0.05) distribution, and we would expect about 40-0.05 = 2 to be 
wrong. The standard deviation of this distribution is V40 - 0.05 - 0.95 = 1.38. 
The outcome “10 confidence intervals wrong” is (10 — 2)/1.38 = 5.8 standard 
deviations from the expectation and would be a surprising outcome indeed. 
(The probability of 10 or more wrong is 0.00002.) 


356 23 Confidence intervals for the mean 


23.4 We need to solve P(Z > a) = 0.01. In Table B.1 we find P(Z > 2.33) = 
0.0099 = 0.01, so 20.01 * 2.33. For zo.95 we need to solve P(Z > a) = 0.95, 
and because this is in the left tail of the distribution, we use 20.95 = —2Zo.05- 
In the table we read P(Z > 1.64) = 0.0505 and P(Z > 1.65) = 0.0495, from 
which we conclude 20.05 ~ (1.64 + 1.65) /2 = 1.645 and Z0.95 ~ —1.645. 


23.5 In Table B.1 we find P(T3 > 4.541) = 0.01, so t39.01 = 4.541. For 
135,0.9975, we need to use ¢35,9.9975 = —¢35,0.0025- In the table we find t30,0.0025 = 
3.030 and £40,0.0025 => 2.971, and by interpolation £35 0.0025 y (3.030 + 
2.971)/2 = 3.0005. Hence, 350.9975 x —3.000. 


23.6 The order statistics are estimates for CG 925 and Co.975, respectively. So 
the corresponding a is 0.05, and the 95% bootstrap confidence interval for ju 
is: 


1037.3 1037.3 
656.88 — 1.713 , 656.88 — (—2.443) ——— } = (504.0, 875.0). 
( V135 ( ) V135 ) ( 


23.6 Exercises 


23.1 4) A bottling machine is known to fill wine bottles with amounts that 
follow an N(y,07) distribution, with o = 5 (ml). In a sample of 16 bottles, 
& = 743 (ml) was found. Construct a 95% confidence interval for pu. 


23.2 CL] You are given a dataset that may be considered a realization of a 
normal random sample. The size of the dataset is 34, the average is 3.54, and 
the sample standard deviation is 0.13. Construct a 98% confidence interval 
for the unknown expectation wp. 


23.3 You have ordered 10 bags of cement, which are supposed to weigh 94 kg 
each. The average weight of the 10 bags is 93.5 kg. Assuming that the 10 
weights can be viewed as a realization of a random sample from a normal 
distribution with unknown parameters, construct a 95% confidence interval 
for the expected weight of a bag. The sample standard deviation of the 10 
weights is 0.75. 


23.4 A new type of car tire is launched by a tire manufacturer. The auto- 
mobile association performs a durability test on a random sample of 18 of 
these tires. For each tire the durability is expressed as a percentage: a score 
of 100 (%) means that the tire lasted exactly as long as the average standard 
tire, an accepted comparison standard. From the multitude of factors that in- 
fluence the durability of individual tires the assumption is warranted that the 
durability of an arbitrary tire follows an N(j,07) distribution. The parame- 
ters ps and o? characterize the tire type, and y could be called the durability 
index for this type of tire. The automobile association found for the tested 
tires: Z1g = 195.3 and sg = 16.7. Construct a 95% confidence interval for p. 


23.6 Exercises 357 


23.5 H During the 2002 Winter Olympic Games in Salt Lake City a newspaper 
article mentioned the alleged advantage speed-skaters have in the 1500 m race 
if they start in the outer lane. In the men’s 1500 m, there were 24 races, but 
in race 13 (really!) someone fell and did not finish. The results in seconds of 
the remaining 23 races are listed in Table 23.5. You should know that who 
races against whom, in which race, and who starts in the outer lane are all 
determined by a fair lottery. 


Table 23.5. Speed-skating results in seconds, men’s 1500m (except race 13), 2002 
Winter Olympic Games. 


Race Inner Outer Difference 


number lane lane 

1 107.04 105.98 1.06 
2 109.24 108.20 1.04 
3 111.02 108.40 2.62 
4 108.02 108.58 —0.56 
5 107.83 105.51 2.32 
6 109.50 112.01 —2.51 
7 111.81 112.87 —1.06 
8 111.02 106.40 4.62 
9 106.04 104.57 1.47 
10 110.15 110.70 —0.55 
11 109.42 109.45 —0.03 
12 108.13 109.57 —1.44 
14 105.86 105.97 —0.11 
15 108.27 105.63 2.64 
16 107.63 105.41 2.22 
17 107.72 110.26 —2.54 
18 106.38 105.82 0.56 
19 107.78 106.29 1.49 
20 108.57 107.26 1.31 
21 106.99 103.95 3.04 
22 107.21 106.00 1.21 
23 105.34 105.26 0.08 
24 108.76 106.75 2.01 
Mean 108.25 107.43 0.82 


St.dev. 1.70 2.42 1.78 


a. As a consequence of the lottery and the fact that many different factors 
contribute to the actual time difference “inner lane minus outer lane” the 
assumption of a normal distribution for the difference is warranted. The 
numbers in the last column can be seen as realizations from an N(6, 07) 


358 


23 Confidence intervals for the mean 


distribution, where 6 is the expected outer lane advantage. Construct a 
95% confidence interval for 6. N.B. n = 23, not 24! 


b. You decide to make a bootstrap confidence interval instead. Describe the 
appropriate bootstrap experiment. 

c. The bootstrap experiment was performed with one thousand repetitions. 
Part of the bootstrap outcomes are listed in the following table. From the 
ordered list of results, numbers 21 to 60 and 941 to 980 are given. Use 
these to construct a 95% bootstrap confidence interval for 6. 

21-25 2.202 —2.164 —2.111 —2.109 —2.101 

26-30 2.099 —2.006 —1.985 —1.967 —1.929 

31-35 1.917 —1.898 —1.864 —1.830 —1.808 

36-40 1.800 —1.799 —1.774 —1.773 —1.756 

41-45 1.736 —1.732 —1.731 —1.717 —1.716 

46-50 1.699 —1.692 —1.691 -—1.683 —1.666 

51-55 1.661 —1.644 -—1.638 -—1.637 —1.620 

56-60 1.611 —1.611 -—1.601 -—1.600 —1.593 

941-945 1.648 1.667 1.669 1.689 1.696 

946-950 =1.708 1.722 1.726 1.7385 1.814 

951-955 1.816 1.825 1.856 1.862 1.864 

956-960 1.875 1.877 1.897 1.905 1.917 

961-965 1.923 1.948 1.961 1.987 2.001 

966-970 2.015 2.015 2.017 2.018 2.034 

971-975 2.035 2.037 2.039 2.053 2.060 

976-980 2.088 2.092 2.101 2.129 2.148 
23.6 H A dataset 71,%2,...,% is given, modeled as realization of a sam- 
ple X1, X2,...,Xn from an N(,1) distribution. Suppose there are sample 


statistics L, = g(X1,..., Xn) and U, = h(X,...,Xy) such that 


P(Lyn < uw < U,) = 0.95 


for every value of j. Suppose that the corresponding 95% confidence interval 
derived from the data is (In, tn) = (—2,5). 


a. 


Suppose 6 = 34+ 7. Let In = 3Ln +7 and U,, = 3U, +7. Show that 
P(Ln <O< on) = 0.95. 


. Write the 95% confidence interval for 9 in terms of I, and uy. 
. Suppose 6 = 1— yp. Again, find L, and U,, as well as the confidence 


interval for @. 


. Suppose 6 = y?. Can you construct a confidence interval for 6? 


23.6 Exercises 359 


23.7 EL] A 95% confidence interval for the parameter py of a Pois(js) distri- 
bution is given: (2,3). Let X be a random variable with this distribution. 
Construct a 95% confidence interval for P(X = 0) =e. 


23.8 Suppose that in Exercise 23.1 the content of the bottles has to be de- 
termined by weighing. It is known that the wine bottles involved weigh on 
average 250 grams, with a standard deviation of 15 grams, and the weights 
follow a normal distribution. For a sample of 16 bottles, an average weight of 
998 grams was found. You may assume that 1 ml of wine weighs 1 gram, and 
that the fillng amount is independent of the bottle weight. Construct a 95% 
confidence interval for the expected amount of wine per bottle, pu. 


23.9 Consider the alpha-particle counts discussed in Section 23.4; the data 
are given in Table 23.4. We want to bootstrap in order to make a bootstrap 
confidence interval for the expected number of particles in a 7.5-second inter- 
val. 


a. Describe in detail how you would perform the bootstrap simulation. 


b. The bootstrap experiment was performed with one thousand repetitions. 
Part of the (ordered) bootstrap t*’s are given in the following table. Con- 
struct the 95% bootstrap confidence interval for the expected number of 
particles in a 7.5-second interval. 


1-5 2.996 —2.942 —2.831 —2.663 —2.570 


6-10 2.5387 —2.505 —2.290 —2.273 —2.228 
11-15 2.193 —2.112 —2.092 —2.086 —2.045 
16-20 1.983 —1.980 —1.978 —1.950 —1.931 
21-25 1.920 —1.910 —1.893 —1.889 —1.888 
26-30 1.865 —1.864 —1.832 —1.817 —1.815 
31-35 1.755 —1.751 —1.749 —1.746 —1.744 
36-40 1.734 —1.723 —1.710 —1.708 —1.705 
41-45 1.703 —1.700 —1.696 —1.692 —1.691 
46-50 1.691 —1.675 —1.660 —1.656 —1.650 


951-955 1.635 1.638 1.643 1.648 1.661 
956-960 1.666 1.668 1.678 1.681 1.686 
961-965 1692 1.719 1.721 1.753 = 1.772 
966-970 1.773 1.777 = =1.806 =61.814 1.821 
971-975 1.824 1.826 1.837 1.8388 1.845 
976-980 1.862 1.877 1.881 1.883 1.956 
981-985 1.971 1.992 2.060 2.063 2.083 
986-990 2.089 2.177 2181 2.186 2.224 
991-995 2.234 2.264 2.273 2.310 2.348 
996-1000 2.483 2.556 2.870 2.890 3.546 


360 23 Confidence intervals for the mean 


c. Answer this without doing any calculations: if we made the 98% boot- 
strap confidence interval, would it be smaller or larger than the interval 
constructed in Section 23.4? 


23.10 In a report you encounter a 95% confidence interval (1.6,7.8) for the 
parameter ys of an N(y, 07) distribution. The interval is based on 16 observa- 
tions, constructed according to the studentized mean procedure. 


a. What is the mean of the (unknown) dataset? 
b. You prefer to have a 99% confidence interval for jz. Construct it. 


23.11 A 95% confidence interval for the unknown expectation of some 
distribution contains the number 0. 


a. We construct the corresponding 98% confidence interval, using the same 
data. Will it contain the number 0? 

b. The confidence interval in fact is a bootstrap confidence interval. We re- 
peat the bootstrap experiment (using the same data) and construct a new 
95% confidence interval based on the results. Will it contain the number 0? 

c. We collect new data, resulting in a dataset of the same size. With this data, 
we construct a 95% confidence interval for the unknown expectation. Will 
the interval contain 0? 


23.12 Let Z1,...,Z, be arandom sample from an N(0, 1) distribution. Define 
X; = ut+oZ; fori=1,...,nando > 0. Let Z, X denote the sample averages 
and Sz and Sx the sample standard deviations, of the Z; and X;, respectively. 


a. Show that X1,...,X, is arandom sample from an N(j1, 07) distribution. 
b. Express X and Sx in terms of Z, Sz, ps, and o. 
c. Verify that 


5. 0 
Sx/J/n— Sz/Jfn? 


and explain why this shows that the distribution of the studentized mean 
does not depend on p and oa. 


24 


More on confidence intervals 


While in Chapter 23 we were solely concerned with confidence intervals for 
expectations, in this chapter we treat a variety of topics. First, we focus on 
confidence intervals for the parameter p of the binomial distribution. Then, 
based on an example, we briefly discuss a general method to construct confi- 
dence intervals. One-sided confidence intervals, or upper and lower confidence 
bounds, are discussed next. At the end of the chapter we investigate the ques- 
tion of how to determine the sample size when a confidence interval of a certain 
width is desired. 


24.1 The probability of success 


A common situation is that we observe a random variable X with a Bin(n, p) 
distribution and use X to estimate p. For example, if we want to estimate 
the proportion of voters that support candidate G in an election, we take a 
sample from the voter population and determine the proportion in the sample 
that supports G. If n individuals are selected at random from the population, 
where a proportion p supports candidate G, the number of supporters X in 
the sample is modeled by a Bin(n, p) distribution; we count the supporters of 
candidate G as “successes.” Usually, the sample proportion X/n is taken as 
an estimator for p. 

If we want to make a confidence interval for p, based on the number of suc- 
cesses X in the sample, we need to find statistics L and U (see the definition 
of confidence intervals on page 343) such that 


P(L<p<U)=1-a, 


where L and U are to be based on X only. In general, this problem does 
not have a solution. However, the method for large n described next, some- 
times called “the Wilson method” (see [40]), yields confidence intervals with 


362 24 More on confidence intervals 


confidence level approximately 100(1 — a)%. (How close the true confidence 
level is to 100(1 — a)% depends on the (unknown) p, though it is known that 
for p near 0 and 1 it is too low. For some details and an alternative for this 
situation, see Remark 24.1.) 

Recall the normal approximation to the binomial distribution, a consequence 
of the central limit theorem (see page 201 and Exercise 14.5): for large n, the 
distribution of X is approximately normal and 


X — np 
np(1 — p) 


is approximately standard normal. By dividing by n in both the numerator 
and the denominator, we see that this equals: 


~_—p 
p(l~p) 
Therefore, for large n 
xX _ 
Pl —2Za/2 < 4 < 2q xl-a 
i p(i-p) 2 
Note that the event . 
—Za/2< —_ < 2 
fa p(i—p) - 
is the same as : 
x 
n BP 2 
< (Zq 
p(1—p) ( p) 
or , 
xX 2 pil-p 
(= -») — (Za/2) ( ) <0 
n n 


To derive expressions for ZL and U we can rewrite the inequality in this state- 
ment to obtain the form DL < p < U, but the resulting formulas are rather 
awkward. To obtain the confidence interval, we instead substitute the data 
values directly and then solve for p, which yields the desired result. 


Suppose, in a sample of 125 voters, 78 support one candidate. What is the 95% 
confidence interval for the population proportion p supporting that candidate? 
The realization of X is 7 = 78 and n = 125. We substitute this, together with 
2/2 = 20.025 = 1.96, in the last inequality: 


(B-») - Se na-n <0. 


24.1 The probability of success 363 


0.4 0.54 0.70 0.8 


Fig. 24.1. The parabola 1.0307 p? — 1.2787 p + 0.3894 and the resulting confidence 
interval. 


or, working out squares and products and grouping terms: 
1.0307 p? — 1.2787 p+ 0.3894 < 0. 


This quadratic form describes a parabola, which is depicted in Figure 24.1. 
Also, for other values of n and « there always results a quadratic inequality like 
this, with a positive coefficient for p? and a similar picture. For the confidence 
interval we need to find the values where the parabola intersects the horizontal 
axis. The solutions we find are: 


—(—1.2787) + \/(—1.2787)? — 4- 1.0307 - 0.3894 
2- 1.0307 


= 0.6203 + 0.0835; 


P12 = 


hence, / = 0.54 and u = 0.70, so the resulting confidence interval is (0.54, 0.70). 


QUICK EXERCISE 24.1 Suppose in another election we find 80 supporters in a 
sample of 200. Suppose we use a = 0.0456 for which zy/2 = 2. Construct the 
corresponding confidence interval for p. 


Remark 24.1 (Coverage probabilities and an alternative method). 
Because of the discrete nature of the binomial distribution, the probabil- 
ity that the confidence interval covers the true parameter value depends 
on p. As a function of p it typically oscillates in a sawtooth-like manner 
around 1 — a, being too high for some values and too low for others. This 
is something that cannot be escaped from; the phenomenon is present in 
every method. In an average sense, the method treated in the text yields 
coverage probabilities close to 1— a, though for arbitrarily high values of n 
it is possible to find p’s for which the actual coverage is several percentage 
points too low. The low coverage occurs for p’s near 0 and 1. 


364 24 More on confidence intervals 


An alternative is the method proposed by Agresti and Coull, which overall 

is more conservative than the Wilson method (in fact, the Agresti-Coull 

interval contains the Wilson interval as a proper subset). Especially for p 

near 0 or 1 this method yields conservative confidence intervals. Define 

(Za/2)” 
2 


X=X+ and i=n+ (Za/2)”, 


and p = X /n. The approximate 100(1 — a)% confidence interval is then 


given by 
(* B= eyj04f ) 6+ 2aprfe etd man). 


For a clear survey paper on confidence intervals for p we recommend Brown 
et al. [4]. 


24.2 Is there a general method? 


We have now seen a number of examples of confidence intervals, and while it 
should be clear to you that in each of these cases the resulting intervals are 
valid confidence intervals, you may wonder how we go about finding confidence 
intervals in new situations. One could ask: is there a general method? We first 
consider an example. 


A confidence interval for the minimum lifetime 


Suppose we have a random sample Xj,...,X, from a shifted exponential 
distribution, that is, X; = 6+ Y;, where Y,,...,Y, are a random sample from 
an Exp(1) distribution. This type of random variable is sometimes used to 
model lifetimes; a minimum lifetime is guaranteed, but otherwise the lifetime 
has an exponential distribution. The unknown parameter 6 represents the 
minimum lifetime, and the probability density of the X; is positive only for 
values greater than 6. 


To derive information about 6 it is natural to use the smallest observed value 
T = min{X),...,X,}. This is also the maximum likelihood estimator for 6; 
see Exercise 21.6. Writing 


T=min{do+Y,...,6+Yn}=d+min{Y%,...,Y¥,} 


and observing that M = min{Yj,...,Y,} has an Exp(n) distribution (see 
Exercise 8.18), we find for the distribution function of T: Fr(a) = 0 fora <6 
and 


Fr(a) =P(T <a) =P(6+M <a) =P(M <a-—5d) 


24.1 
=1—e"-9) fora> 6. ( ) 


Next, we solve 


24.2 Is there a general method? 


P(q <T <cu) =1l-a 


by requiring 


Pl <¢)=P(f > c,) = Fa: 
Using (24.1) we find the following equations: 
1—e Ma-9) = 4a and e (cu-9) = 4a 
whose solutions are 
cq =é6- * in (1 $a) and c,=d-— ~ in (4a). 


365 


Both c; and c, are values larger than 6, because the logarithms are negative. 


We have found that, whatever the value of 6: 
1 1 1 1 
P 6 ——In(1— $a) <T <6-=—In(Ja) =1l-a. 
n n 
By rearranging the inequalities, we see this is equivalent to 
1 1 u 1 
P T + —In ($a) <6<T+=—In(1—- $a) =1l-a, 
and therefore a 100(1 — a)% confidence interval for 6 is given by 
t+ re t+ aca +a) 
n a ne n 2 . 


For a = 0.05 this becomes: 


( 3.69 20088) 
eee ee 


nm mv 


(24.2) 


QUICK EXERCISE 24.2 Suppose you have a dataset of size 15 from a shifted 
Exp (1) distribution, whose minimum value is 23.5. What is the 99% confidence 


interval for 6? 


Looking back at the example, we see that the confidence interval could be 
constructed because we know that T—6 = M has an exponential distribution. 
There are many more examples of this type: some function g(T, 4) of a sample 
statistic T and the unknown parameter 0 has a known distribution. However, 
this still does not cover all the ways to construct confidence intervals (see also 


the following remark). 


Remark 24.2 (About a general method). Suppose X1,...,Xn is a 


random sample from some distribution depending on some unknown pa- 
rameter 0 and let T’ be a sample statistic. One possible choice is to select 
a T that is an estimator for 0, but this is not necessary. In each case, the 


366 24 More on confidence intervals 


distribution of T’ depends on @, just as that of X1,...,Xn does. In some 
cases it might be possible to find functions g(@) and h(@) such that 


P(g(0) < T < h(@)) =1-—a for every value of 0. (24.3) 


If this is so, then confidence statements about 0 can be made. In more 
special cases, for example if g and h are strictly increasing, the inequalities 
g(9) < T < h(@) can be rewritten as 


eT eee), 
and then (24.3) is equivalent to 
P(h'(T) <0<g' T)) =1-a for every value of 0. 


Checking with the confidence interval definition, we see that the last state- 
ment implies that (h~'(t),g~'(t)) is a 100(1 —a)% confidence interval for 0. 


24.3 One-sided confidence intervals 


Suppose you are in charge of a power plant that generates and sells electricity, 
and you are about to buy a shipment of coal, say a shipment of the Daw Mill 
coal identified as 258GB41 earlier. You plan to buy the shipment if you are 
confident that the gross calorific content exceeds 31.00 MJ/kg. At the end of 
Section 23.2 we obtained for the gross calorific content the 95% confidence 
interval (30.946, 31.067): based on the data we are 95% confident that the 
gross calorific content is higher than 30.946 and lower than 31.067. 


In the present situation, however, we are only interested in the lower bound: 
we would prefer a confidence statement of the type “we are 95% confident 
that the gross calorific content exceeds 31.00.” Modifying equation (23.4) we 


find s 
Xn — pb - 
(= Seay) =1-a, 


which is equivalent to 


= Sin 
P(X, IE Tp, < 1) =1-a. 
nr 


(2 tose 
Ln — ln-la TF, © 
“i /n 


is a 100(1 — a)% one-sided confidence interval for 4. For the Daw Mill coal, 
using a = 0.05, with t21,0.05 = 1.721 this results in: 


We conclude that 


1294 
(s1.o12 i fy aco c) = (30.964, 00). 
V2 


24.4 Determining the sample size 367 


We see that because “all uncertainty may be put on one side,” the lower 
bound in the one-sided interval is higher than that in the two-sided one, 
though still below 31.00. Other situations may require a confidence upper 
bound. For example, if the calorific value is below a certain number you can 
try to negotiate a lower the price. 


The definition of confidence intervals (page 343) can be extended to include 
one-sided confidence intervals as well. If we have a sample statistic LD, such 
that 

P(In <@)=¥ 


for every value of the parameter of interest 6, then 
(In, 00) 


is called a 1007% one-sided confidence interval for 6. The number I, is 
sometimes called a 1007% lower confidence bound for @. Similary, U, with 
P(O0 < U,) = ¥ for every value of 0, yields the one-sided confidence interval 
(—co, Un), and uy, is called a 1007% upper confidence bound. 


QUICK EXERCISE 24.3 Determine the 99% upper confidence bound for the 
gross calorific value of the Daw Mill coal. 


24.4 Determining the sample size 


The narrower the confidence interval the better (why?). As a general prin- 
ciple, we know that more accurate statements can be made if we have more 
measurements. Sometimes, an accuracy requirement is set, even before data 
are collected, and the corresponding sample size is to be determined. We pro- 
vide an example of how to do this and note that this generally can be done, 
but the actual computation varies with the type of confidence interval. 
Consider the question of the calorific content of coal once more. We have a 
shipment of coal to test and we want to obtain a 95% confidence interval, 
but it should not be wider than 0.05 MJ/kg, i.e., the lower and upper bound 
should not differ more than 0.05. How many measurements do we need? 

We answer this question for the case when ISO method 1928 is used, whence 
we may assume that measurements are normally distributed with standard 
deviation 0 = 0.1. When the desired confidence level is 1 — a, the width of 
the confidence interval will be 


a 
2: 20/2" Te 
Requiring that this is at most w means finding the smallest n that satisfies 


on 
22a /2= <w 


Tas 


368 24 More on confidence intervals 


Qa = 
v2 (222) 
W 


For the example: w = 0.05, 0 = 0.1, and zo.925 = 1.96; so 


2 
ae 2-1.96-0.1 = 614, 
0.05 


or 


that is, we should perform at least 62 measurements. 


In case o is unknown, we somehow have to estimate it, and then the method 
can only give an indication of the required sample size. The standard deviation 
as we (afterwards) estimate it from the data may turn out to be quite different, 
and the obtained confidence interval may be smaller or larger than intended. 


QUICK EXERCISE 24.4 What is the required sample size if we want the 99% 
confidence interval to be 0.05 MJ/kg wide? 


24.5 Solutions to the quick exercises 
24.1 We need to solve 


80) Op) <0, or 1.02p*—-0.82p+0.16 <0 
0) aoe eS ere ee 


The solutions are: 


—(—0.82) + \/(—0.82)? — 4- 1.02 - 0.16 
2-1.02 
so the confidence interval is (0.33, 0.47). 


= 0.4020 + 0.0686, 


P,2> 


24.2 We should substitute n = 15, t = 23.5, and a = 0.01 into: 


iL 1 
(++ — In (52) ,é+ ~ In (1 — 40)) F 
which yields 


5.30 0.0050 
(23: $= oS 


) = (23.1467, 23.4997). 


24.3 The upper confidence bound is given by 


_ 8 
Un = En + t21,0.01 735" 


where Z, = 31.012, te1,0.01 = 2.518, and s, = 0.1294. Substitution yields 
Un = 31.081. 


24.6 Exercises 369 


24.4 The confidence level changes to 99%, so we use 20,905 = 2.576 instead 
of 1.96 in the computation: 


2-2.576-0.1\? 
> | a) = ee 
n>( 0.05 ) 


so we need at least 107 measurements. 


24.6 Exercises 


24.1 G Of a series of 100 (independent and identical) chemical experiments, 
70 were concluded succesfully. Construct a 90% confidence interval for the 
success probability of this type of experiment. 


24.2 In January 2002 the Euro was introduced and soon after stories started 
to circulate that some of the Euro coins would not be fair coins, because the 
“national side” of some coins would be too heavy or too light (see, for example, 
the New Scientist of January 4, 2002, but also national newspapers of that 
date). 


a. A French 1 Euro coin was tossed six times, resulting in 1 heads and 5 tails. 
Is it reasonable to use the Wilson method, introduced in Section 24.1, to 
construct a confidence interval for p? 


b. A Belgian 1 Euro coin was tossed 250 times: 140 heads and 110 tails. 
Construct a 95% confidence interval for the probability of getting heads 
with this coin. 


24.3 In Exercise 23.1, what sample size is needed if we want a 99% confidence 
interval for w at most 1 ml wide? 


24.4 ©) Recall Exercise 23.3 and the 10 bags of cement that should each weigh 
94 kg. The average weight was 93.5 kg, with sample standard deviation 0.75. 


a. Based on these data, how many bags would you need to sample to make 
a 90% confidence interval that is 0.1 kg wide? 


b. Suppose you actually do measure the required number of bags and con- 
struct a new confidence interval. Is it guaranteed to be at most 0.1 kg 
wide? 


24.5 Suppose we want to make a 95% confidence interval for the probability 
of getting heads with a Dutch 1 Euro coin, and it should be at most 0.01 
wide. To determine the required sample size, we note that the probability of 
getting heads is about 0.5. Furthermore, if X has a Bin(n,p) distribution, 
with n large and p + 0.5, then 


370 24 More on confidence intervals 
X — np 
\/n/4 


a. Use this statement to derive that the width of the 95% confidence interval 
for p is approximately 


is approximately standard normal. 


20.025 
vn” 


Use this width to determine how large n should be. 


b. The coin is thrown the number of times just computed, resulting in 19477 
times heads. Construct the 95% confidence interval and check whether the 
required accuracy is attained. 


24.6 4 Environmentalists have taken 16 samples from the wastewater of a 
chemical plant and measured the concentration of a certain carcinogenic sub- 
stance. They found £15 = 2.24 (ppm) and s?, = 1.12, and want to use these 
data in a lawsuit against the plant. It may be assumed that the data are a 
realization of a normal random sample. 


a. Construct the 97.5% one-sided confidence interval that the environmen- 
talists made to convince the judge that the concentration exceeds legal 
limits. 

b. The plant management uses the same data to construct a 97.5% one- 
sided confidence interval to show that concentrations are not too high. 
Construct this interval as well. 


24.7 Consider once more the Rutherford-Geiger data as given in Section 23.4. 
Knowing that the number of a-particle emissions during an interval has a 
Poisson distribution, we may see the data as observations from a Pois(,) 
distribution. The central limit theorem tells us that the average X,, of a large 
number of independent Pois(j:) approximately has a normal distribution and 
hence that 

Xn — pb 

vulva 


has a distribution that is approximately N(0, 1). 


a. Show that the large sample 95% confidence interval contains those values 
of y for which 
(@n — nw)” < (1.96)?4. 
mr 
b. Use the result from a to construct the large sample 95% confidence interval 
based on the Rutherford-Geiger data. 


c. Compare the result with that of Exercise 23.9 b. Is this surprising? 


24.8 EI] Recall Exercise 23.5 about the 1500 m speed-skating results in the 2002 
Winter Olympic Games. If there were no outer lane advantage, the number 


24.6 Exercises 371 


out of the 23 completed races won by skaters starting in the outer lane would 
have a Bin(23,p) distribution with p = 1/2, because of the lane assignment 
by lottery. 


a. Of the 23 races, 15 were won by the skater starting in the outer lane. Use 
this information to construct a 95% confidence interval for p by means 
of the Wilson method. If you think that n = 23 is probably too small to 
use a method based on the central limit theorem, we agree. We should be 
careful with conclusions we draw from this confidence interval. 


b. The question posed earlier “Is there an outer lane advantage?” implies that 
a one-sided confidence interval is more suitable. Construct the appropriate 
95% one-sided confidence interval for p by first constructing a 90% two- 
sided confidence interval. 


24.9 H Suppose we have a dataset x1,...,2% 12 that may be modeled as the 
realization of a random sample X1,...,Xi2 from a U(0, 6) distribution, with 
@ unknown. Let M = max{Xj,..., X12}. 


a. Show that forO<t<1 


b. Use a = 0.1 and solve 


c. Suppose the realization of M is m = 3. Construct the 90% confidence 
interval for 0. 


d. Derive the general expression for a confidence interval of level 1 — a based 
on a sample of size n. 


24.10 Suppose we have a dataset 71,...,2%, that may be modeled as the 
realization of a random sample X1,...,X, from an Exp (A) distribution, where 
» is unknown. Let S, =X, +---+ Xn. 


a. Check that AS, has a Gam(n,1) distribution. 


b. The following quantiles of the Gam(20,1) distribution are given: go.05 = 
13.25 and qo.95 = 27.88. Use these to construct a 90% confidence interval 
for A when n = 20. 


25 


Testing hypotheses: essentials 


The statistical methods that we have discussed until now have been devel- 
oped to infer knowledge about certain features of the model distribution that 
represent our quantities of interest. These inferences often take the form of 
numerical estimates, as either single numbers or confidence intervals. How- 
ever, sometimes the conclusion to be drawn is not expressed numerically, but 
is concerned with choosing between two conflicting theories, or hypotheses. 
For instance, one has to assess whether the lifetime of a certain type of ball 
bearing deviates or does not deviate from the lifetime guaranteed by the man- 
ufacturer of the bearings; an engineer wants to know whether dry drilling is 
faster or the same as wet drilling; a gynecologist wants to find out whether 
smoking affects or does not affect the probability of getting pregnant; the Al- 
lied Forces want to know whether the German war production is equal to or 
smaller than what Allied intelligence agencies reported. The process of formu- 
lating the possible conclusions one can draw from an experiment and choosing 
between two alternatives is known as hypothesis testing. In this chapter we 
start to explore this statistical methodology. 


25.1 Null hypothesis and test statistic 


We will introduce the basic concepts of hypothesis testing with an exam- 
ple. Let us return to the analysis of German war equipment. During World 
War II the Allied Forces received reports by the Allied intelligence agencies 
on German war production. The numbers of produced tires, tanks, and other 
equipment, as claimed in these reports, were a lot higher than indicated by 
the observed serial numbers. The objective was to decide whether the actual 
produced quantities were smaller than the ones reported. 

For simplicity suppose that we have observed tanks with (recoded) serial num- 


bers 
61 19 56 24 16. 


374 25 Testing hypotheses: essentials 


Furthermore, suppose that the Allied intelligence agencies report a production 
of 350 tanks.! This is a lot more than we would surmise from the observed 
data. We want to choose between the proposition that the total number of 
tanks is 350 and the proposition that the total number is smaller than 350. 
The two competing propositions are called null hypothesis, denoted by Ho, and 
alternative hypothesis, denoted by H;. The way we go about choosing between 
A and Hy, is conceptually similar to the way a jury deliberates in a court 
trial. The null hypothesis corresponds to the position of the defendant: just 
as he is presumed to be innocent until proven guilty, so is the null hypothesis 
presumed to be true until the data provide convincing evidence against it. 
The alternative hypothesis corresponds to the charges brought against the 
defendant. 


To decide whether Ho is false we use a statistical model. As argued in Chap- 
ter 20 the (recoded) serial numbers are modeled as a realization of random 
variables X 1, X2,...,X5 representing five draws without replacement from the 
numbers 1,2,...,N. The parameter N represents the total number of tanks. 
The two hypotheses in question are 


Ho : N = 350 
H,: N < 350. 


If we reject the null hypothesis we will accept H; we speak of rejecting Ho 
in favor of H,. Usually, the alternative hypothesis represents the theory or 
belief that we would like to accept if we do reject Ho. This means that we 
must carefully choose H, in relation with our interests in the problem at hand. 
In our example we are particularly interested in whether the number of tanks 
is less than 350; so we test the null hypothesis against H, : N < 350. If we 
would be interested in whether the number of tanks differs from 350, or is 
greater than 350, we would test against H; : N # 350 or H; : N > 350. 


QUICK EXERCISE 25.1 In the drilling example from Sections 15.5 and 16.4 the 
data on drill times for dry drilling are modeled as a realization of a random 
sample from a distribution with expectation j1, and similarly the data for wet 
drilling correspond to a distribution with expectation jz. We want to know 
whether dry drilling is faster than wet drilling. To this end we test the null 
hypothesis Ho : 441 = 2 (the drill time is the same for both methods). What 
would you choose for H,? 


The next step is to select a criterion based on Xj, X2,...,X5 that provides an 
indication about whether Hp is false. Such a criterion involves a test statistic. 


' This may seem ridiculous. However, when after the war official German produc- 
tion statistics became available, the average monthly production of tanks during 
the period 1940-1943 was 342. During the war this number was estimated at 327, 
whereas Allied intelligence reported 1550! (see [27]). 


25.1 Null hypothesis and test statistic 375 


TEST STATISTIC. Suppose the dataset is modeled as the realization 
of random variables X1,X9,...,Xn. A test statistic is any sample 
statistic T = h(X1, Xo,..., Xn), whose numerical value is used to 
decide whether we reject Ho. 


In the tank example we use the test statistic 
‘ie max{X1, Xo, sae ,X5}. 


Having chosen a test statistic TJ’, we investigate what sort of values T’ can 
attain. These values can be viewed on a credibility scale for Hp, and we must 
determine which of these values provide evidence in favor of Ho, and which 
provide evidence in favor of H,. First of all note that if we find a value of 
T larger than 350, we immediately know that Ho as well as H; is false. If 
this happens, we actually should be considering another testing problem, but 
for the current problem of testing Ho : N = 350 against Hy : N < 350 such 
values are irrelevant. Hence the possible values of T that are of interest to us 
are the integers from 5 to 350. 


If Ho is true, then what is a typical value for T and what is not? Remember 
from Section 20.1 that, because n = 5, the expectation of T is E[T] = 2(N+1). 
This means that the distribution of T is centered around 2(N +1). Hence, if 
HA is true, then typical values of T are in the neighborhood of 2 -351 = 292.5. 
Values of T that deviate a lot from 292.5 are evidence against Ho. Values that 
are much greater than 292.5 are evidence against Ho but provide even stronger 
evidence against H,. For such values we will not reject Ho in favor of H,. Also 
values a little smaller than 292.5 are grounds not to reject Ho, because we are 
committed to giving Ho the benefit of the doubt. On the other hand, values 
of T very close to 5 should be considered as strong evidence against the null 
hypothesis and are in favor of H,, hence they lead to a decision to reject Ho. 
This is summarized in Figure 25.1. 


Values in Values in Values against 
favor of Hy favor of Ho both Ho and Ay 
iis 
5 292.5 350 


Fig. 25.1. Values of the test statistic T. 


QUICK EXERCISE 25.2 Another possible test statistic would be X5. If we use 
its values as a credibility scale for Hp, then what are the possible values of 
X5, which values of X5 are in favor of H, : N < 350, and which values are in 
favor of Hp : N = 350? 


376 25 Testing hypotheses: essentials 
For the data we find 
t = max{61, 19,56, 24, 16} = 61 


as the realization of the test statistic. How do we use this to decide on Ho? 


25.2 Tail probabilities 


As we have just seen, if Ho is true, then typical values of T are in the neighbor- 
hood of 2 -351 = 292.5. In view of Figure 25.1, the more a value of T is to the 
left, the stronger evidence it provides in favor of H;. The value 61 is in the left 
region of Figure 25.1. Can we now reject Hp and conclude that N is smaller 
than 350, or can the fact that we observe 61 as maximum be attributed to 
chance? In courtroom terminology: can we reach the conclusion that the null 
hypothesis is false beyond reasonable doubt? One way to investigate this is to 
examine how likely it is that one would observe a value of T that provides 
even stronger evidence against Ho than 61, in the situation that N = 350. If 
this is very unlikely, then 61 already bears strong evidence against Ho. 


Values of T that provide stronger evidence against Ho than 61 are to the 
left of 61. Therefore we compute P(T' < 61). In the situation that N = 350, 
the test statistic T is the maximum of 5 numbers drawn without replacement 
from 1,2,...,350. We find that 


P(T < 61) = P(max{ Xj, X2,...,X5} < 61) 
_ 61 60 57 
~ 350 349 346 


This probability is so small that we view the value 61 as strong evidence 
against the null hypothesis. Indeed, if the null hypothesis would be true, then 
values of T that would provide the same or even stronger evidence against Ho 
than 61 are very unlikely to occur, i.e., they occur with probability 0.00014! 
In other words, the observed value 61 is exceptionally small in case Ho is true. 


At this point we can do two things: either we believe that Ho is true and 
that something very unlikely has happened, or we believe that events with 
such a small probability do not happen in practice, so that T < 61 could 
only have occurred because Ho is false. We choose to believe that things 
happening with probability 0.00014 are so exceptional that we reject the null 
hypothesis Ho : N = 350 in favor of the alternative hypothesis Hy : N < 350. 
In courtroom terminology: we say that a value of T smaller than or equal to 
61 implies that the null hypothesis is false beyond reasonable doubt. 


P-values 


In our example, the more a value of T is to the left, the stronger evidence 
it provides against Ho. For this reason we computed the left tail probability 


25.3 Type I and type II errors 377 


P(T < 61). In other situations, the direction in which values of T provide 
stronger evidence against Hp may be to the right of the observed value tf, 
in which case one would compute a right tail probability P(T >t). In both 
cases the tail probability expresses how likely it is to obtain a value of the 
test statistic T at least as extreme as the value t observed for the data. Such 
a probability is called a p-value. In a way, the size of the p-value reflects how 
much evidence the observed value ¢ provides against Ho. The smaller the 
p-value, the stronger evidence the observed value t bears against Ho. 

The phrase “at least as extreme as the observed value t” refers to a particular 
direction, namely the direction in which values of T provide stronger evidence 
against Ho and in favor of H,. In our example, this was to the left of 61, and 
the p-value corresponding to 61 was P(T < 61) = 0.00014. In this case it is 
clear what is meant by “at least as extreme as ¢” and which tail probability 
corresponds to the p-value. However, in some testing problems one can deviate 
from Ho in both directions. In such cases it may not be clear what values of 
T are at least as extreme as the observed value, and it may be unclear how 
the p-value should be computed. One approach to a solution in this case is 
to simply compute the one-tailed p-value that corresponds to the direction in 
which t deviates from Ho. 


QUICK EXERCISE 25.3 Suppose that the Allied intelligence agencies had re- 
ported a production of 80 tanks, so that we would test Ho : N = 80 against 
Hy: N < 80. Compute the p-value corresponding to 61. Would you conclude 
Ho is false beyond reasonable doubt? 


25.3 Type I and type II errors 


Suppose that the maximum is 200 instead of 61. This is also to the left of 
the expected value 292.5 of T. Is it far enough to the left to reject the null 
hypothesis? In this case the p-value is equal to 


P(T < 200) = P(max{X1, Xo,...,X5} < 200) 


This means that if the total number of produced tanks is 350, then in 5.96% 
of all cases we would observe a value of T' that is at least as extreme as the 
value 200. Before we decide whether 0.0596 is small enough to reject the null 
hypothesis let us explore in more detail what the preceding probability stands 
for. 

It is important to distinguish between (1) the true state of nature: Ho is true 
or Hy, is true and (2) our decision: we reject or do not reject Ho on the basis 
of the data. In our example the possibilities for the true state of nature are: 


e Ho is true, i.e., there are 350 tanks produced. 
e AH, is true, i.e., the number of tanks produced is less than 350. 


378 25 Testing hypotheses: essentials 


We do not know in which situation we are. There are two possible decisions: 


e We reject Ho in favor of Ay. 
e We do not reject Ho. 


This leads to four possible situations, which are summarized in Figure 25.2. 


True state of nature 


Reject Ho Type I error | Correct decision 
Our decision on the 


basis of the data 


Not reject Ho | Correct decision | Type II error 


Fig. 25.2. Four situations when deciding about Ho. 


There are two situations in which the decision made on the basis of the data is 
wrong. The null hypothesis Hp may be true, whereas the data lead to rejection 
of Ho. On the other hand, the alternative hypothesis H; may be true, whereas 
we do not reject Ho on the basis of the data. These wrong decisions are called 
type I and type II errors. 


TyPE I AND II ERRORS. A type I error occurs if we falsely reject 
Hy. A type II error occurs if we falsely do not reject Ho. 


In courtroom terminology, a type I error corresponds to convicting an innocent 
defendant, whereas a type II error corresponds to acquitting a criminal. 

If Ho : N = 350 is true, then the decision to reject Ho is a type I error. We 
will never know whether we make a type I error. However, given a particular 
decision rule, we can say something about the probability of committing a 
type I error. Suppose the decision rule would be “reject Ho : N = 350 when- 
ever T < 200.” With this decision rule the probability of committing a type I 
error is P(T < 200) = 0.0596. If we are willing to run the risk of committing 
a type I error with probability 0.0596, we could adopt this decision rule. This 
would also mean that on the basis of an observed maximum of 200 we would 
reject Ho in favor of Hy: N < 350. 


QUICK EXERCISE 25.4 Suppose we adopt the following decision rule about the 
null hypothesis: “reject Ho : N = 350 whenever T < 250.” Using this decision 
rule, what is the probability of committing a type I error? 


25.4 Solutions to the quick exercises 379 


The question remains what amount of risk one is willing to take to falsely 
reject Ho, or in courtroom terminology: how small should the p-value be to 
reach a conclusion that is “beyond reasonable doubt”? In many situations, 
as a rule of thumb 0.05 is used as the level where reasonable doubt begins. 
Something happening with probability less than or equal to 0.05 is then viewed 
as being too exceptional. However, there is no general rule that specifies how 
small the p-value must be to reject Ho. There is no way to argue that this 
probability should be below 0.10 or 0.18 or 0.009—or anything else. 


A possible solution is to solely report the p-value corresponding to the ob- 
served value of the test statistic. This is objective and does not have the 
arbitrariness of a preselected level such as 0.05. An investigator who reports 
the p-value conveys the maximum amount of information contained in the 
dataset and permits all decision makers to choose their own level and make 
their own decision about the null hypothesis. This is especially important 
when there is no justifiable reason for preselecting a particular value for such 
a level. 


25.4 Solutions to the quick exercises 


25.1 One is interested in whether dry drilling is faster than wet drilling. 
Hence if we reject Ho : f41 = 2, we would like to conclude that the drill time 
is smaller for dry drilling than for wet drilling. Since jz; and p2 represent the 
drill time for dry and wet drilling, we should choose Hy : ty < po. 


25.2 The value of Xz is at least 3 and if we find a value of X; that is larger 
than 348, then at least one of the five numbers must be greater than 350, so 
that we immediately know that Ho as well as Hj, is false. Hence the possible 
values of X5 that are relevant for our testing problem are between 3 and 348. 
We know from Section 20.1 that 2X5 — 1 is an unbiased estimator for N, 
no matter what the value of N is. This implies that values of Xs itself are 
centered around (NV + 1)/2. Hence values close to 351/2=175.5 are in favor 
of Ho, whereas values close to 3 are in favor of H,. Values close to 348 are 
against Ho, but also against H,. See Figure 25.3. 


Values in Values in Values against 
favor of Hy favor of Ho both Ho and Ay 
—_— ee tice 

3 175.5 348 


Fig. 25.3. Values of the test statistic Xs. 


25.3 The p-value corresponding to 61 is now equal to 


380 25 Testing hypotheses: essentials 


If Ho is true, then in 24.75% of the time one will observe a value T less than 
or equal to 61. Such values are not exceptionally small for T under Ho, and 
therefore the evidence that the value 61 bears against Ho is pretty weak. We 
cannot reject Hg beyond reasonable doubt. 


25.4 The type I error associated with the decision rule occurs if N = 350 
(Ho is true) and t < 250 (reject Ho). The probability that this happens is 


P(T < 250) = $33. 48... 38 = 0.1838. 


25.5 Exercises 


25.1 In a study about train delays in The Netherlands one was interested in 
whether arrival delays of trains exhibit more variation during rush hours than 
during quiet hours. The observed arrival delays during rush hours are mod- 
eled as realizations of a random sample from a distribution with variance o7, 
and similarly the observed arrival delays during quiet hours correspond to a 
distribution with variance a3 . One tests the null hypothesis Hp : 01 = o2. 
What do you choose as the alternative hypothesis? 


25.2 LE] On average, the number of babies born in Cleveland, Ohio, in the 
month of September is 1472. On January 26, 1977, the city was immobilized 
by a blizzard. Nine months later, in September 1977, the recorded number of 
births was 1718. Can the increase of 246 be attributed to chance? To inves- 
tigate this, the number of births in the month of September is modeled by a 
Poisson random variable with parameter jz, and we test Ho : w= 1472. What 
would you choose as the alternative hypothesis? 


25.3 Recall Exercise 17.9 about black cherry trees. The scatterplot of y (vol- 
ume) versus « = dh (squared diameter times height) seems to indicate that 
the regression line y = a+ (Gx runs through the origin. One wants to inves- 
tigate whether this is true by means of a testing problem. Formulate a null 
hypothesis and alternative hypothesis in terms of (one of) the parameters a 
and £3. 


25.4 H Consider the example from Section 4.4 about the number of cycles 
up to pregnancy of smoking and nonsmoking women. Suppose the observed 
number of cycles are modeled as realizations of random samples from geo- 
metric distributions. Let p; be the parameter of the geometric distribution 
corresponding to smoking women and pz be the parameter for the nonsmok- 
ing women. We are interested in whether p, is different from p2, and we 
investigate this by testing Ho : py = p2 against Hy : p; # po. 


a. If the data are as given in Exercise 17.5, what would you choose as a test 
statistic? 


25.5 Exercises 381 


b. What would you choose as a test statistic, if you were given the extra 
knowledge as in Table 21.1? 

c. Suppose we are interested in whether smoking women are less likely to get 
pregnant than nonsmoking women. What is the appropriate alternative 
hypothesis in this case? 


25.5 EJ Suppose a dataset is a realization of a random sample Xj, X2,...,Xn 
from a uniform distribution on [0,6], for some (unknown) @ > 0. We test 
Ho: 0=5 versus H,:0#45. 


a. We take T; = max{X), Xo,...,Xn} as our test statistic. Specify what 
the (relevant) possible values are for T and which are in favor of Hp and 
which are in favor of H;. For instance, make a picture like Figure 25.1. 


b. Same as a, but now for test statistic Ty = (ae —5|. 


25.6 E] To test a certain null hypothesis Ho one uses a test statistic T with 
a continuous sampling distribution. One agrees that Hp is rejected if one 
observes a value t of the test statistic for which (under Ho) the right tail 
probability P(T > t) is smaller than or equal to 0.05. Given below are different 
values ¢ and a corresponding left or right tail probability (under Ho). Specify 
for each case what the p-value is, if possible, and whether we should reject Ho. 


a. t = 2.34 and P(T > 2.34) = 0.23. 
b. ¢ = 2.34 and P(T < 2.34) = 0.23. 
c. t = 0.03 and P(T > 0.03) = 0.968. 
d. t= 1.07 and P(T < 1.07) = 0.981. 
e. t= 1.07 and P(T < 2.34) = 0.01. 
f. t = 2.34 and P(T < 1.07) = 0.981. 
g. t = 2.34 and P(T < 1.07) = 0.800. 


25.7 (Exercise 25.2 continued). The number of births in September is mod- 
eled by a Poisson random variable T with parameter jz, which represents the 
expected number of births. Suppose that one uses T' to test the null hypothe- 
sis Ho : wp = 1472 and that one decides to reject Ho on the basis of observing 
the value t = 1718. 


a. In which direction do values of T provide evidence against Ho (and in 
favor of H,)? 

b. Compute the p-value corresponding to t = 1718, where you may use the 
fact that the distribution of T can be approximated by an N(w, 4) distri- 
bution. 


25.8 Suppose we want to test the null hypothesis that our dataset is a realiza- 
tion of a random sample from a standard normal distribution. As test statistic 
we use the Kolmogorov-Smirnov distance between the empirical distribution 


382 25 Testing hypotheses: essentials 


function F;, of the data and the distribution function ® of the standard nor- 
mal: 
T = sup |F,(a) — ®(a)|. 
a€R 
What are the possible values of T and in which direction do values of T deviate 
from the null hypothesis? 


25.9 Recall the example from Section 18.3, where we investigated whether the 
software data are exponential by means of the Kolmogorov-Smirnov distance 
between the empirical distribution function F;, of the data and the estimated 
exponential distribution function: 


Tks = sup |Fn(a) — (1 — 74%). 
acR 
For the data we found ty, = 0.176. By means of a new parametric bootstrap 
we simulated 100000 realizations of T;,, and found that all of them are smaller 
than 0.176. What can you say about the p-value corresponding to 0.176? 


25.10 H Consider the coal data from Table 23.1, where 23 gross calorific value 
measurements are listed for Osterfeld coal coded 262DE27. We modeled this 
dataset as a realization of a random sample from a normal distribution with 
expectation 4 unknown and standard deviation 0.1 MJ/kg. We are planning 
to buy a shipment if the gross calorific value exceeds 23.75 MJ/kg. In order 
to decide whether this is sensible, we test the null hypothesis Hp : 4 = 23.75 
with test statistic X,. 


a. What would you choose as the alternative hypothesis? 


b. For the dataset Z,, is 23.788. Compute the corresponding p-value, using 
that X,, has an N(23.75, (0.1)?/23) distribution under the null hypothesis. 


25.11 One is given a number t, which is the realization of a random vari- 
able T with an N(y,1) distribution. To test Ho : u = 0 against Hy : uw £0, 
one uses T' as the test statistic. One decides to reject Hp in favor of Hy, if 
|t| > 2. Compute the probability of committing a type I error. 


26 


Testing hypotheses: elaboration 


In the previous chapter we introduced the setup for testing a null hypothesis 
against an alternative hypothesis using a test statistic T’. The notions of type I 
error and type II error were introduced. A type I error occurs when we falsely 
reject Hp on the basis of the observed value of JT, whereas a type II error 
occurs when we falsely do not reject Ho. The decision to reject Hp or not was 
based on the size of the p-value. In this chapter we continue the introduction 
of basic concepts of testing hypotheses, such as significance level and critical 
region, and investigate the probability of committing a type II error. 


26.1 Significance level 


As mentioned in the previous chapter, there is no general rule that specifies a 
level below which the p-value is considered exceptionally small. However, there 
are situations where this level is set a priori, and the question is: which values 
of the test statistic should then lead to rejection of Hp? To illustrate this, con- 
sider the following example. The speed limit on freeways in The Netherlands 
is 120 kilometers per hour. A device next to freeway A2 between Amsterdam 
and Utrecht measures the speed of passing vehicles. Suppose that the device 
is designed in such a way that it conducts three measurements of the speed 
of a passing vehicle, modeled by a random sample X1, X2, X3. On the basis 
of the value of the average X3, the driver is either fined for speeding or not. 
For what values of X3 should we fine the driver, if we allow that 5% of the 
drivers are fined unjustly? 


Let us rephrase things in terms of a testing problem. Each measurement can 
be thought of as 


measurement = true speed + measurement error. 


Suppose for the moment that the measuring device is carefully calibrated, so 
that the measurement error is modeled by a random variable with mean zero 


384 26 Testing hypotheses: elaboration 


and known variance a”, say ¢? = 4. Moreover, in physical experiments such as 
this one, the measurement error is often modeled by a random variable with a 
normal distribution. In that case, the measurements X,, X2, X3 are modeled 
by a random sample from an N(u,4) distribution, where the parameter pw 
represents the true speed of the passing vehicle. Our testing problem can now 
be formulated as testing 


Hyo:=120 against Hy: u> 120, 


with test statistic X4Xy4X 
pa AitXet hs _ x, 


Since sums of independent normal random variables again have a normal dis- 
tribution (see Remark 11.2), it follows that X3 has an N(,4/3) distribution. 
In particular, the distribution of T = X3 is centered around ps no matter what 
the value of jz is. Values of T close to 120 are therefore in favor of Ho. Values of 
T that are far from 120 are considered as strong evidence against Ho. Values 
much larger than 120 suggest that 4 > 120 and are therefore in favor of Hy. 
Values much smaller than 120 suggest that w < 120. They also constitute 
evidence against Ho, but even stronger evidence against H,. Thus we reject 
Ho in favor of Hy only for values of T larger than 120. See also Figure 26.1. 


Values in 


favor of Hy 


SS YO OO 
120 


Fig. 26.1. Possible values of T = X3. 


Rejection of Ho in favor of H, corresponds to fining the driver for speeding. 
Unjustly fining a driver corresponds to falsely rejecting Ho, i.e., committing 
a type I error. Since we allow 5% of the drivers to be fined unjustly, we are 
dealing with a testing problem where the probability of committing a type I 
error is set a priori at 0.05. The question is: for which values of T should 
we reject Ho? The decision rule for rejecting Hp should be such that the 
corresponding probability of committing a type I error is 0.05. The value 0.05 
is called the significance level. 


SIGNIFICANCE LEVEL. The significance level is the largest accept- 
able probability of committing a type I error and is denoted by a, 
where 0<a<1l. 


We speak of “performing the test at level a,” as well as “rejecting Ho in 
favor of H; at level a.” In our example we are testing Ho : uw = 120 against 


A, : w > 120 at level 0.05. 


26.1 Significance level 385 


QUICK EXERCISE 26.1 Suppose that in the freeway example Ho : = 120 is 
rejected in favor of Hy : yw > 120 at level a = 0.05. Will it necessarily be 
rejected at level a = 0.01? On the other hand, suppose that Ho : uw = 120 
is rejected in favor of Hy : 4 > 120 at level a = 0.01. Will it necessarily be 
rejected at level a = 0.05? 


Let us continue with our example and determine for which values of T = X3 
we should reject Hp at level a = 0.05 in favor of Hy, : yw > 120. Suppose 
we decide to fine each driver whose recorded average speed is 121 or more, 
i.e., we reject Hyp whenever JT > 121. Then how large is the probability of a 
type I error P(T > 121)? When Ho : « = 120 is true, then T = X3 has an 
N(120,4/3) distribution, so that by the change-of-units rule for the normal 
distribution (see page 106), the random variable 


T — 120 
2/V3 


has an N(0,1) distribution. This implies that 


Z= 


T—120 | 121-120 
2/V3 ~  2/Vv3 


From Table B.1, we find P(Z > 0.87) = 0.1922, which means that the prob- 
ability of a type I error is greater than the significance level a = 0.05. Since 
this level was defined as the largest acceptable probability of a type I error, 
we do not reject Ho. Similarly, if we decide to reject Hp whenever we record 
an average of 122 or more, the probability of a type I error equals 0.0416 
(check this). This is smaller than a = 0.05, so in that case we reject Ho. The 
boundary case is the value c that satisfies P(T’ > c) = 0.05. To find c, we must 


solve 
c — 120 
P( Z> = 0.05. 
2/V3 


From Table B.2 we have that 20.05 = too,0.05 = 1.645, so that we find 


P( > 121) =P{ ) = P(Z > 0.87). 


c— 120 
2/V3 


= 1.645, 


which leads to 9 
c= 1204+ 1.645. — = 121.9. 


V3 
Hence, if we set the significance level a at 0.05, we should reject Ho : up = 120 
in favor of Hy : w > 120 whenever T' > 121.9. For our freeway example this 
means that if the average recorded speed of a passing vehicle is greater than 
or equal to 121.9, then the driver is fined for speeding. With this decision rule, 
at most 5% of the drivers get fined unjustly. 


386 26 Testing hypotheses: elaboration 


In connection with p-values: the significance level is the level below which 
the p-value is sufficiently small to reject Ho. Indeed, for any observed value 
t > 121.9 we reject Ho, and the p-value for such a t is at most 0.05: 


P(T >t) < P(T > 121.9) = 0.05. 


We will see more about this relation in the next section. 


26.2 Critical region and critical values 


In the freeway example the significance level 0.05 corresponds to the decision 
rule “reject Ho : 4 = 120 in favor Hy : > 120 whenever T > 121.9.” The 
set AK = [121.9, co) consisting of values of the test statistic T for which we 
reject Ho is called critical region. The value 121.9, which is the boundary case 
between rejecting and not rejecting Ho, is called the critical value. 


CRITICAL REGION AND CRITICAL VALUES. Suppose we test Ho 
against H, at significance level a by means of a test statistic T. 
The set AK C R that corresponds to all values of T for which we 
reject Ho in favor of Hy is called the critical region. Values on the 
boundary of the critical region are called critical values. 


The precise shape of the critical region depends on both the chosen significance 
level a and the test statistic T that is used. But it will always be such that 
the probability that T € K satisfies 


P(T € K) <a _ in the case that Hp is true. 


At this point it becomes important to emphasize whether probabilities are 
computed under the assumption that Hp is true. With a slight abuse of nota- 
tion, we briefly write P(T € K | Ho) for the probability. 


Relation with p-values 


If we record average speed t = 124, then this value falls in the critical region 
K = (121.9, co), so that Hp : uw = 120 is rejected in favor Hy : uw > 120. On 
the other hand we can also compute the p-value corresponding to the observed 
value 124. Since values of T to the right provide stronger evidence against Ho, 
the p-value is the following right tail probability 


T —120 _ 124-120 
2/V3 ~  2/Vv3 


which is smaller than the significance level 0.05. This is no coincidence. 


P(T > 124| Ho) = P( ) = P(Z > 3.46) = 0.0003, 


26.2 Critical region and critical values 387 


In general, suppose that we perform a test at level a using test statistic T 
and that we have observed t as the value of our test statistic. Then 


te kK <s_ the p-value corresponding to t is less than or equal to a. 


Figure 26.2 illustrates this for a testing problem where values of T to the 
right provide evidence against Hp and in favor of H,. In that case, the p-value 
corresponds to the right tail probability P(T’ > t | Ho). The shaded area to the 
right of cy corresponds to a = P(T' > c, | Ho), whereas the more intensely 
shaded area to the right of t represents the p-value. We see that deciding 
whether to reject Ho at a given significance level a can be done by comparing 
either t with c, or the p-value with a. For this reason the p-value is sometimes 
called the observed significance level. 


Sampling distribution 
of T under Ho \ 


™ 


Ca t 


l_. Critical region K = [ca, 00) 


Fig. 26.2. P-value and critical value. 


The concepts of critical value and p-value have their own merit. The critical 
region and the corresponding critical values specify exactly what values of T 
lead to rejection of Ho at a given level a. This can be done even without 
obtaining a dataset and computing the value t of the test statistic. The p- 
value, on the other hand, represents the strength of the evidence the observed 
value t bears against Ho. But it does not specify all values of T that lead to 
rejection of Ho at a given level a. 


QUICK EXERCISE 26.2 In our freeway example, we have already computed 
the relevant tail probability to decide whether a person with recorded average 
speed ¢ = 124 gets fined if we set the significance level at 0.05. Suppose the 
significance level is set at a = 0.01 (we allow 1% of the drivers to get fined 
unjustly). Determine whether a person with recorded average speed t = 124 
gets fined (Hp : « = 120 is rejected). Furthermore, determine the critical 
region in this case. 


388 26 Testing hypotheses: elaboration 


Sometimes the critical region K can be constructed such that P(T' € K | Ho) is 
exactly equal to a, as in the freeway example. However, when the distribution 
of T is discrete, this is not always possible. This is illustrated by the next 
example. 


After the introduction of the Euro, Polish mathematicians claimed that the 
Belgian 1 Euro coin is not a fair coin (see, for instance, the New Scientist, 
January 4, 2002). Suppose we put a 1 Euro coin to the test. We will throw 
it ten times and record X, the number of heads. Then X has a Bin(10,p) 
distribution, where p denotes the probability of heads. We like to find out 
whether p differs from 1/2. Therefore we test 


1 1 
Ho: p= 5 (the coin is fair) against H,:p#¢ 5 (the coin is not fair). 


We use X as the test statistic. When we set the significance level a at 0.05, 
for what values of X will we reject Ho and conclude that the coin is not fair? 
Let us first find out what values of X are in favor of Hy. If Ho : p = 1/2 is 
true, then ELX] = 10-4 = 5, so that values of X close to 5 are in favor Ho. 
Values close to 10 suggest that p > 1/2 and values close to 0 suggest that 
p < 1/2. Hence, both values close to 0 and values close to 10 are in favor of 
Ay: p#1/2. 


Values in Values in 
favor of Hy favor of Hy 
| 
0 1) 10 
Values of X 


This means that we will reject Ho in favor of H, whenever X < c or X > Cy. 
Therefore, the critical region is the set 


K ={0,1,...,c.} Ufey,...,9, 10}. 


The boundary values c; and c, are called left and right critical values. They 
must be chosen such that the critical region K is as large as possible and still 
satisfies 


P(X € K | Ho) =P(X <q |p=$)+P(X >a | p= 4) < 0.05. 


Here PX > Cy | p= +) denotes the probability P(X > c,) computed with X 
having a Bin(10, 4) distribution. Since we have no preference for rejecting Ho 
for values close to 0 or close to 10, we divide 0.05 over the two sides, and we 
choose c as large as possible and c, as small as possible such that 


P(X <q |p=4) <0.025 and P(X >c,|p=4$) < 0.025. 


26.2 Critical region and critical values 389 


Table 26.1. Left tail probabilities of the Bin(10, 3) distribution. 


k P(X <k) k P(X <k) 
0 0.00098 6 0.82813 

1 0.01074 7 0.94531 

2 0.05469 8 0.98926 

3 0.17188 9 0.99902 

4 0.37696 10 1.00000 

5 


0.62305 


The left tail probabilities of the Bin(10,4) distribution are listed in Ta- 
ble 26.1. We immediately see that c) = 1 is the largest value such that 
P(X < c | p= 1/2) < 0.025. Similarly, c,, = 9 is the smallest value such that 
P(X > cy | p= 1/2) < 0.025. Indeed, when X has a Bin(10, 5) distribution, 


P(X > 9) =1—P(X <8) = 1— 0.98926 = 0.01074, 
P(X > 8) =1—P(X <7) =1-— 0.94531 = 0.05469. 


Hence, if we test Ho : p = 1/2 against Hi : p # 1/2 at level a = 0.05, the 
critical region is the set kK = {0,1,9,10}. The corresponding type I error is 


P(X € K) =P(X <1)+P(X > 9) = 0.01074 + 0.01074 = 0.02148, 


which is smaller than the significance level. You may perform ten throws with 
your favorite coin and see whether the number of heads falls in the critical 
region. 


QUICK EXERCISE 26.3 Recall the tank example where we tested Hp : N = 350 
against H, : N < 350 by means of the test statistic T = max X;. Suppose that 
we perform the test at level 0.05. Deduce the critical region K corresponding 
to level 0.05 from the left tail probabilities given here: 


k; 195 194 193 192 191 
P(T <k| Ho) 0.0525 0.0511 0.0498 0.0485 0.0472 


Is P(T € K | Ho) = 0.05? 


One- and two-tailed p-values 


In the Euro coin example, we deviate from Hp : p = 1/2 in two directions: 
values of X both far to the right and far to the left of 5 are evidence against Ho. 
Suppose that in ten throws with the 1 Euro coin we recorded x heads. What 
would the p-value be corresponding to x? The problem is that the direction 
in which values of X are at least as extreme as the observed value x depends 
on whether «x lies to the right or to the left of 5. 


390 26 Testing hypotheses: elaboration 


At this point there are two natural solutions. One may report the appropri- 
ate left or right tail probability, which corresponds to the direction in which 
x deviates from Ho. For instance, if x lies to the right of 5, we compute 
P(X > «| Ho). This is called a one-tailed p-value. The disadvantage of one- 
tailed p-values is that they are somewhat misleading about how strong the 
evidence of the observed value x bears against Ho. In view of the relation 
between rejection on the basis of critical values or on the basis of a p-value, 
the one-tailed p-value should be compared to a/2. On the other hand, since 
people are inclined to compare p-values with the significance level a itself, 
one could also double the one-tailed p-value and compare this with a. This 
double-tail probability is called a two-tailed p-value. It doesn’t make much 
of a difference, as long as one also reports whether the reported p-value is 
one-tailed or two-tailed. 

Let us illustrate things by means of the findings by the Polish mathematicians. 
They performed 250 throws with a Belgian 1 Euro coin and recorded heads 
140 times (see also Exercise 24.2). The question is whether this provides strong 
enough evidence against Ho : p = 1/2. The observed value 140 is to the right 
of 125, the value we would expect if Ho is true. Hence the one-tailed p-value 
is P(X > 140), where now X has a Bin(250, 4) distribution. By means of the 
normal approximation (see page 201), we find 


X-125 _ 140-125 
a > ———S 
/4v250 4/4350 
~ P(Z > 1.90) = 1 — 8(1.90) = 0.0287. 


P(X > 140) =P 


Therefore the two-tailed p-value is approximately 0.0574, which does not pro- 
vide very strong evidence against Ho. In fact, the exact two-tailed p-value, 
computed by means of statistical software, is 0.066, which is even larger. 


QUICK EXERCISE 26.4 In a Dutch newspaper (De Telegraaf, January 3, 2002) 
it was reported that the Polish mathematicians recorded heads 150 times. 
What are the one- and two-tailed probabilities is this case? Do they now have 
a case? 


26.3 Type II error 


As we have just seen, by setting a significance level a, we are able to control 
the probability of committing a type I error; it will at most be a. For instance, 
let us return to the freeway example and suppose that we adopt the decision 
rule to fine the driver for speeding if her average observed speed is at least 
121.9, ice., 


reject Ho : 4 = 120 in favor of Hy, : p > 120 whenever T = X3 > 121.9. 


26.3. Type II error 391 


From Section 26.1 we know that with this decision rule, the probability of a 
type I error is 0.05. What is the probability of committing a type II error? 
This corresponds to the percentage of drivers whose true speed is above 120 
but who do not get fined because their recorded average speed is below 121.9. 
For instance, suppose that a car passes at true speed p = 125. A type II error 
occurs when T < 121.9, and since T = X3 has an N(125,4/3) distribution, 
the probability that this happens is 
T-125 | 121.9-—125 
P(T < 121.9 | w = 125) = (> < ace 
2/3 2/V3 
= ®(—2.68) = 0.0036. 
This looks promising, but now consider a vehicle passing at true speed uw = 
123. The probability of committing a type II error in this case is 
T — 123 2 121.9 — =) 
2/V3 2/V3 
= ®(—0.95) = 0.1711. 


P(T <121.9| p=123) = P( 


Hence 17.11% of all drivers that pass at speed ys = 123 will not get fined. In 
Figure 26.3 the last situation is illustrated. The curve on the left represents the 
probability density of the N(120,4/3) distribution, which is the distribution 
of T under the null hypothesis. The shaded area on the right of 121.9 represents 
the probability of committing a type I error 


P(T > 121.9 | p = 120) = 0.05. 


The curve on the right is the probability density of the N(123, 4/3) distribu- 
tion, which is the distribution of J under the alternative w = 123. The shaded 
area on the left of 121.9 represents the probability of a type II error 


0.5 

Sampling 
0.4 distribution 

of T when Sampling 
i Ho is true \, distribution 

: of T when 
= 123 
0.2 i . 
0.1 
aT a eel cera Mr 
IT 
120 121.9 


Do not reject Hyp --~—> Reject Ho 


Fig. 26.3. Type I and type II errors in the freeway example. 


392 26 Testing hypotheses: elaboration 


P(T < 121.9 | » = 123) =0.1711. 


Shifting w further to the right will result in a smaller probability of a type II 
error. However, shifting 4 toward the value 120 leads to a larger probability 
of a type II error. In fact it can be arbitrarily close to 0.95. 

The previous example illustrates that the probability of committing a type II 
error depends on the actual value of ys in the alternative hypothesis Hy : uw > 
120. The closer yz is to 120, the higher the probability of a type IT error will 
be. In contrast with the probability of a type I error, which is always at most 
a, the probability of a type II error may be arbitrarily close to 1 — a. This is 
illustrated in the next quick exercise. 


QUICK EXERCISE 26.5 What is the probability of a type II error in the freeway 
example if w = 120.1? 


26.4 Relation with confidence intervals 


When testing Ho : u = 120 against Hy : w > 120 at level 0.05 in the freeway 
example, the critical value was obtained by the formula 


2 
Co.05 = 120+ 1.645 - —. 
0.05 B 
On the other hand, using that X3 has an N(,4/3) distribution, a 95% lower 
confidence bound for pu in this case can be derived from 


= 2 

bi = 03 — 1.645 - V3 

Although, at first sight, testing hypotheses and constructing confidence inter- 
vals seem to be two separate statistical procedures, they are in fact intimately 
related. In the freeway example, observe that for a given dataset 11, 72, Z3, 


we reject Hp : 4 = 120 in favor of Hy : w > 120 at level 0.05 


2 
& £3 > 1204 1.645-— 
o J3 
2 
= £3 —1.645-— > 120 
3 Te 


< 120 is not in the 95% one-sided confidence interval for pu. 


This is not a coincidence. In general, the following applies. Suppose that for 
some parameter 0 we test Ho : 0 = 09. Then 

we reject Ho : 6 = 9 in favor of H, : 8 > Oo at level a 

if and only if 

Oo is not in the 100(1 — a)% one-sided confidence interval for 6. 


26.5 Solutions to the quick exercises 393 


The same relation holds for testing against H, : 8 < 69, and a similar relation 
holds between testing against H, : 8 # 0) and two-sided confidence intervals: 


we reject Hp : 6 = Oo in favor of Hy : 09 4 M% at level a 
if and only if 
Oo is not in the 100(1 — a)% two-sided confidence region for 6. 


In fact, one could use these facts to define the 100(1—a)% confidence region for 
a parameter @ as the set of values 69 for which the null hypothesis Ho : 6 = 6 
is not rejected at level a. 


It should be emphasized that these relations only hold if the random variable 
that is used to construct the confidence interval relates appropriately to the 
test statistic. For instance, the preceding relations do not hold if on the one 
hand, we construct a confidence interval for the parameter ps of an N(, 07) 


distribution by means of the studentized mean (X,,—1)/(S/./n), and on the 
other hand, use the sample median Med,, to test a null hypothesis for ju. 


26.5 Solutions to the quick exercises 


26.1 In the first situation, we reject at significance level a = 0.05, which 
means that the probability of committing a type I error is at most 0.05. This 
does not necessarily mean that this probability will also be less than or equal to 
0.01. Therefore with this information we cannot know whether we also reject 
at level a = 0.01. In the reversed situation, if we reject at level a = 0.01, then 
the probability of committing a type I error is at most 0.01, and is therefore 
also smaller than 0.05. This means that we also reject at level a = 0.05. 


26.2 To decide whether we should reject Ho : 4 = 120 at level 0.01, we could 
compute P(T > 124 | Ho) and compare this with 0.01. We have already seen 
that P(T > 124 | Ho) = 0.0003. This is (much) smaller than the significance 
level a = 0.01, so we should reject. 


The critical region is K = [c, oo), where we must solve c from 


P(z a =) = 0.01. 
2/V3 


Since z.01 = 2.326, this means that c = 120 + 2.326 - (2//3) = 122.7. 


26.3 The critical region is of the form K = {5,6,...,c}, where the criti- 
cal value c is the largest value, for which P(T' < c| Ho) is still less than or 
equal to 0.05. From the table we immediately see that c = 193 and that 
P(T € K | Ho) =P(T < 193| Ho) = 0.0498, which is not equal to 0.05. 


394 26 Testing hypotheses: elaboration 


26.4 By means of the normal approximation, for the one-tailed p-value we 
find 


X—125 _ 150-125 
ee ane > ee 
[4v250 4/4350 
= P(Z, > 3.16) © 1 — &(3.16) = 0.0008. 


P(X > 150) =P 


The two-tailed p-value is 0.0016. This is a lot smaller than the two-tailed p- 
value 0.0574, corresponding to 140 heads. It seems that with 150 heads the 
mathematicians would have a case; the Belgian Euro coin would then appear 
not to be fair. 


26.5 The probability of a type II error is 


T1201 . 1219-1201 
P(T < 121.9 | w= 120.1) = P( : iad ) 


ave «aS 
= (1.56) = 0.9406. 


26.6 Exercises 


26.1 Polygraphs that are used in criminal investigations are supposed to in- 
dicate whether a person is lying or telling the truth. However the procedure 
is not infallible, as is illustrated by the following example. An experienced 
polygraph examiner was asked to make an overall judgment for each of a 
total 280 records, of which 140 were from guilty suspects and 140 from inno- 
cent suspects. The results are listed in Table 26.2. We view each judgment 
as a problem of hypothesis testing, with the null hypothesis corresponding to 
“suspect is innocent” and the alternative hypothesis to “suspect is guilty.” 
Estimate the probabilities of a type I error and a type IJ error that apply to 
this polygraph method on the basis of Table 26.2. 


26.2 Consider the testing problem in Exercise 25.11. Compute the probability 
of committing a type II error if the true value of yp is 1. 


26.3 EJ One generates a number zx from a uniform distribution on the interval 
(0, 6]. One decides to test Hp : 0 = 2 against H, : 0 # 2 by rejecting Ho if 
x<0.1 or c > 1.9. 


a. Compute the probability of committing a type I error. 


b. Compute the probability of committing a type II error if the true value 
of 6 is 2.5. 


26.4 To investigate the hypothesis that a horse’s chances of winning an eight- 
horse race on a circular track are affected by its position in the starting lineup, 


26.6 Exercises 395 


Table 26.2. Examiners and suspects. 


Suspect’s true status 


Innocent Guilty 


Acquitted 131 15 
Examiner’s 
assesment 
Convicted 


Source: F.S. Horvath and J.E. Reid. The reliability of polygraph examiner 
diagnosis of truth and deception. Journal of Criminal Law, Criminology, 
and Police Science, 62(2):276—281, 1971. 


the starting position of each of 144 winners was recorded ([30]). It turned out 
that 29 of these winners had starting position one (closest to the rail on the 
inside track). We model the number of winners with starting position one by 
a random variable T with a Bin(144, p) distribution. We test the hypothesis 
Ho: p=1/8 against H, : p > 1/8 at level a = 0.01 with T as test statistic. 


a. Argue whether the test procedure involves a right critical value, a left 
critical value, or both. 

b. Use the normal approximation to compute the critical value(s) correspond- 
ing to a = 0.01, determine the critical region, and report your conclusion 
about the null hypothesis. 


26.5 H Recall Exercises 23.5 and 24.8 about the 1500 m speed-skating results 
in the 2002 Winter Olympic Games. The number of races won by skaters 
starting in the outer lane is modeled by a random variable X with a Bin(23, p) 
distribution. The question of whether there is an outer lane advantage was 
investigated in Exercise 24.8 by means of constructing confidence intervals 
using the normal approximation. In this exercise we examine this question by 
testing the null hypothesis Ho : p = 1/2 against H, : p > 1/2 using X as the 
test statistic. The distribution of X under Hp is given in Table 26.3. Out of 
23 completed races, 15 were won by skaters starting in the outer lane. 


a. Compute the p-value corresponding to = 15 and report your conclusion 
if we perform the test at level 0.05. Does your conclusion agree with the 
confidence interval you found for p in Exercise 24.8 b? 

b. Determine the critical region corresponding to significance level a = 0.05. 

c. Compute the probability of committing a type I error if we base our 
decision rule on the critical region determined in b. 


396 26 Testing hypotheses: elaboration 


Table 26.3. Left tail probabilities for the Bin (23, 5) distribution. 


P(X <k) k P(X<k) k P(X <k) 


k 

0 0.0000 8 0.1050 16 0.9827 
1 0.0000 9 0.2024 17 0.9947 
2 0.0000 10 0.3388 18 0.9987 
3 0.0002 11 0.5000 19 0.9998 
4 0.0013 12 0.6612 20 = 1.0000 
5 0.0053 13 0.7976 21 1.0000 
6 0.0173 14 0.8950 22 1.0000 
7 0.0466 15 0.9534 23 1.0000 


d. Use the normal approximation to determine the probability of committing 
a type II error for the case p = 0.6, if we base our decision rule on the 
critical region determined in b. 


26.6 E] Consider Exercises 25.2 and 25.7. One decides to test Ho : uw = 1472 
against Hy : 4 > 1472 at level a = 0.05 on the basis of the recorded value 
1718 of the test statistic T. 


a. Argue whether the test procedure involves a right critical value, a left 
critical value, or both. 


b. Use the fact that the distribution of T can be approximated by an N(w, 1) 
distribution to determine the critical value(s) and the critical region, and 
report your conclusion about the null hypothesis. 


26.7 A random sample X;, X2 is drawn from a uniform distribution on the 
interval [0,6]. We wish to test Hp : @ = 1 against H; : 0 < 1 by rejecting if 
X,+ X2 < c. Find the value of c and the critical region that correspond to a 
level of significance 0.05. 
Hint: use Exercise 11.5. 


26.8 H This exercise is meant to illustrate that the shape of the critical region 
is not necessarily similar to the type of alternative hypothesis. The type of 
alternative hypothesis and the test statistic used determine the shape of the 
critical region. 


Suppose that X1,X2,...,Xn form a random sample from an Exp(A) distri- 
bution, and we test Hp : \= 1 with test statistics T = X, and T’ =e~*". 


a. Suppose we test the null hypothesis against H; : A > 1. Determine for 
both test procedures whether they involve a right critical value, a left 
critical value, or both. 


b. Same question as in part a, but now test against Hy, : A #1. 


26.6 Exercises 397 


26.9 H Similar to Exercise 26.8, but with a random sample Xj, X2,...,Xn 


from an N(, 1) distribution. We test Ho : = 0 with test statistics T = (X,)? 


and T’ =1/Xn. 


a. Suppose that we test the null hypothesis against Hi : yu 4 0. Determine 
the shape of the critical region for both test procedures. 


b. Same question as in part a, but now test against Hy, : > 0. 


27 


The t-test 


In many applications the quantity of interest can be represented by the ex- 
pectation of the model distribution. In some of these applications one wants 
to know whether this expectation deviates from some a priori specified value. 
This can be investigated by means of a statistical test, known as the test. 
We consider this test both under the assumption that the model distribution 
is normal and without the assumption of normality. Furthermore, we discuss a 
similar test for the slope and the intercept in a simple linear regression model. 


27.1 Monitoring the production of ball bearings 


A production line in a large industrial corporation are set to produce a spe- 
cific type of steel ball bearing with a diameter of 1 millimeter. In order to 
check the performance of the production lines, a number of ball bearings are 
picked at the end of the day and their diameters are measured. Suppose we ob- 
serve 20 diameters of ball bearings from the production lines, which are listed 
in Table 27.1. The average diameter is Z29 = 1.03 millimeter. This clearly 
deviates from the target value 1, but the question is whether the difference 
can be attributed to chance or whether it is large enough to conclude that 
the production line is producing ball bearings with a wrong diameter. To an- 
swer this question, we model the dataset as a realization of a random sample 
X 1, X2,...,X209 from a probability distribution with expected value p. The 
parameter p represents the diameter of ball bearings produced by the produc- 


Table 27.1. Diameters of ball bearings. 


1.018 1.009 1.042 1.053 0.969 1.002 0.988 1.019 1.062 1.032 
1.072 0.977 1.062 1.044 1.069 1.029 0.979 1.096 1.079 0.999 


400 27 The ttest 


tion lines. In order to investigate whether this diameter deviates from 1, we 
test the null hypothesis Ho : w= 1 against Hy: uw A 1. 


This example illustrates a situation that often occurs: the data 71, 2%2,...,%n 
are a realization of a random sample Xj, X2,...,X» from a distribution with 
expectation 4, and we want to test whether p equals an a priori specified value, 
say jig. According to the law of large numbers, X;, is close to pz for large n. 
This suggests a test statistic based on Xn Lio; realizations of i = Lo close 
to zero are in favor of the null hypothesis. Does X,, — po suffice as a test 
statistic? 

In our example, %, — wo = 1.03 — 1 = 0.03. Should we interpret this as small? 
First, note that under the null hypothesis E [Xn = Lo| = [t — Uo = 0. Now, if 
Xp — fo would have standard deviation 1, then the value 0.03 is within one 
standard deviation of E [Xn - Lo]. The “yw + a few o” rule on page 185 then 
suggests that the value 0.03 is not exceptional; it must be seen as a small 
deviation. On the other hand, if X,, — po has standard deviation 0.001, then 
the value 0.03 is 30 standard deviations away from E Ee - Lo| . According to 
the “yw + afew o” rule this is very exceptional; the value 0.03 must be seen 
as a large deviation. The next quick exercise provides a concrete example. 


QUICK EXERCISE 27.1 Suppose that X,, is a normal random variable with 
expectation 1 and variance 1. Determine Pi Xe -1> 0.03). Find the same 
probability, but for the case where the variance is (0.01)?. 


This discussion illustrates that we must standardize X,, — Lo to incorporate 
its variation. Recall that 


o2 


Var (Xp, _ Lo) = Var(X;,) =—, 
n 

where o? is the variance of each X;. Hence, standardizing X,, — juo means 
that we should divide by o/,/n. Since o is unknown, we substitute the sample 
standard deviation S,, for 0. This leads to the following test statistic for the 
null hypothesis Ho : 4 = po: 
Xn — Ho 
Sn/Jn- 


Values of T' close to zero are in favor of Ho : js = po. Large positive values of 
T suggest that 4 > uo and large negative values suggest that 4 < Wo; both 
are evidence against Ho. 
For the ball bearing data one finds that s,, = 0.0372, so that 

En — Lo 1.03 — 1 


t= + = = ——_— = 3.607. 
8n/J/n — 0.0372/./20 


This is clearly different from zero, but the question is whether this difference 
is large enough to reject Ho : 4 = 1. To answer this question, we need to know 


T= 


27.2 The one-sample ¢test 401 


the probability distribution of T under the null hypothesis. Note that under 
the null hypothesis Ho : u = po, the test statistic 


Xn — Ho 
Sn/y/n 


is the studentized mean (see also Chapter 23) 


T= 


Xn — pb 

Sn/J/n- 
Hence, under the null hypothesis, the probability distribution of T is the same 
as that of the studentized mean. 


27.2 The one-sample t-test 


The classical assumption is that the dataset is a realization of a random sample 
from an N(y,07) distribution. In that case our test statistic T turns out to 
have a t-distribution under the null hypothesis, as we will see later. For this 
reason, the test for the null hypothesis Ho : 1 = po is called the (one-sample) 
t-test. Without the assumption of normality, we will use the bootstrap to 
approximate the distribution of 7. For large sample sizes, this distribution 
can be approximated by means of the central limit theorem. We start with 
the first case. 


Normal data 


Suppose that the dataset 21,272,...,2, is a realization of a random sample 
X1, X2,..., Xn from an N(, 07) distribution. Then, according to the rule on 
page 349, the studentized mean has a t(n — 1) distribution. An immediate 
consequence is that, under the null hypothesis Hp : = wo, also our test 
statistic T has a t(n — 1) distribution. Therefore, if we test Ho : fs = [Mo 
against Hy : w # po at level a, then we must reject the null hypothesis in 
favor of Hy :  # jo, if 


T< —th—1,a/2 or TS tn—1,a/2: 


Similar decision rules apply to alternatives Hy, : uw > po and Ay: p < Mo. 
Suppose that in the ball bearing example we test Ho : = 1 against Hy : 
pt £1 at level a = 0.05. From Table B.2 we find t19,0.025 = 2.093. Hence, we 
must reject if 7 < —2.093 or T > 2.093. For the ball bearing data we found 
t = 3.607, which means we reject the null hypothesis at level a = 0.05. 

Alternatively, one might report the one-tailed p-value corresponding to the 
observed value t and compare this with a/2. The one-tailed p-value is ei- 
ther a right or a left tail probability, which must be computed by means 


402 27 The ttest 


of the t(n—1) distribution. In our ball bearing example the one-tailed p- 
value is the right tail probability P(T > 3.607). From Table B.2 we see 
that this probability is between 0.0005 and 0.0010, which is smaller than 
a/2 = 0.025 (to be precise, by means of a statistical software package we 
found P(T > 3.607) = 0.00094). The data provide strong enough evidence 
against the null hypothesis, so that it seems sensible to adjust the settings of 
the production line. 


QUICK EXERCISE 27.2 Suppose that the data in Table 27.1 are from two 
separate production lines. The first ten measurements have average 1.0194 and 
standard deviation 0.0290, whereas the last ten measurements have average 
1.0406 and standard deviation 0.0428. Perform the +test Ho : 4 = 1 against 
A, : w #1 at level a = 0.01 for both datasets separately, assuming normality. 


Nonnormal data 


Draw a rectangle with height h and width w (let us agree that w > h), and 
within this rectangle draw a square with sides of length h (see Figure 27.1). 
This creates another (smaller) rectangle with horizontal and vertical sides of 


Fig. 27.1. Rectangle with square within. 


lengths w—h and h. A large rectangle with a vertical-to-horizontal ratio that 
is equal to the horizontal-to-vertical ratio for the small rectangle, i.e., 
h_ w—h 


w hh? 


was called a “golden rectangle” by the ancient Greeks, who often used these in 
their architecture. After solving for h/w, we obtain that the height-to-width 


27.2 The one-sample ¢test 403 


Table 27.2. Ratios for Shoshoni rectangles. 


0.693 0.749 0.654 0.670 0.662 0.672 0.615 0.606 0.690 0.628 
0.668 0.611 0.606 0.609 0.601 0.553 0.570 0.844 0.576 0.933 


Source: C. Dubois (ed.). Lowie’s selected papers in anthropology, 1960. 
© The Regents of the University of California. 


ratio h/w is equal to the “golden number” (V5 — 1)/2 © 0.618. The data in 
Table 27.2 represent corresponding h/w ratios for rectangles used by Shoshoni 
Indians to decorate their leather goods. Is it reasonable to assume that they 
were also using golden rectangles? We examine this by means of a test. 


The observed ratios are modeled as a realization of a random sample from a 
distribution with expectation y, where the parameter jp represents the true 
esthetic preference for height-to-width ratios of the Shoshoni Indians. We want 
to test 

Ho: w=0.618 against Ay: uw # 0.618. 


For the Shoshoni ratios, Z, = 0.6605 and s, = 0.0925, so that the value of 
the test statistic is 


Fn —0.618 0.6605 — 0.618 
n= = 2.055. 
8n/J/n 0.0925 /./20 


Closer examination of the data indicates that the normal distribution is not 
the right model. For instance, by definition the height-to-width ratios h/w 
are always between 0 and 1. Because some of the data points are also close 
to right boundary 1, the normal distribution is inappropriate. If we cannot 
assume a normal model distribution, we can no longer conclude that our test 
statistic has a ¢(n — 1) distribution under the null hypothesis. 


t= 


Since there is no reason to assume any other particular type of distribution 
to model the data, we approximate the distribution of J’ under the null hy- 
pothesis. Recall that this distribution is the same as that of the studentized 
mean (see the end of Section 27.1). To approximate its distribution, we use 
the empirical bootstrap simulation for the studentized mean, as described 
on page 351. We generate 10000 bootstrap datasets and for each bootstrap 
dataset 2},x3,...,2%, we compute 


z* — 0.6605 

sr/vyn | 
In Figure 27.2 the kernel density estimate and empirical distribution function 
are displayed for 10000 bootstrap values ¢*. Suppose we test Ho : uw = 0.618 


against H; : 4 ~ 0.618 at level a = 0.05. In the same way as in Section 23.3, 
we find the following bootstrap approximations for the critical values: 


t= 


c, = —3.334 and = c) = 1.644. 


404 27 The ttest 


4 0.975 
0.3 
0.2 
Ol 
0.0 0.025 
6 1 2 © 2 a -3.334 0 1.644 


Fig. 27.2. Kernel density estimate and empirical distribution function of 10000 
bootstrap values t*. 


Since for the Shoshoni data the value 2.055 of the test statistic is greater 
than 1.644, we reject the null hypothesis at level 0.05. Alternatively, we can 
also compute a bootstrap approximation of the one-tailed p-value correspond- 
ing to 2.055, which is the right tail probability P(T > 2.055). The bootstrap 
approximation for this probability is: 


number of t*values greater than or equal to 2.055 
10 000 


Hence P(T > 2.055) & 0.0067, which is smaller than a/2 = 0.025. The value 
2.055 should be considered as exceptionally large, and we reject the null hy- 
pothesis. The esthetic preference for height-to-width ratios of the Shoshoni 
Indians differs from that of the ancient Greeks. 


= 0.0067. 


Large samples 


For large sample sizes the distribution of the studentized mean can be ap- 
proximated by a standard normal distribution (see Section 23.4). This means 
that for large sample sizes the distribution of the ttest statistic under the 
null hypothesis can also be approximated by a standard normal distribution. 
To illustrate this, recall the Old Faithful data. Park rangers in Yellowstone 
National Park inform the public about the behavior of the geyser, such as the 
expected time between successive eruptions and the length of the duration of 
an eruption. Suppose they claim that the expected length of an eruption is 
4 minutes (240 seconds). Does this seem likely on the basis of the data from 
Section 15.1? We investigate this by testing Hp : uw = 240 against H, : w 4 240 
at level a = 0.001, where yu is the expectation of the model distribution. The 
value of the test statistic is 


_ Zn —240 _ 209.3- 240 _ 
— SniVn  68.48//272 — 


t 


27.3 The t-test in a regression setting 405 


The one-tailed p-value P(T < —7.39) can be approximated by P(Z < —7.39), 
where Z has an N(0, 1) distribution. From Table B.1 we see that this probabil- 
ity is smaller than P(Z < —3.49) = 0.0002. This is smaller than a/2 = 0.0005, 
so we reject the null hypothesis at level 0.001. In fact the p-value is much 
smaller: a statistical software package gives P(Z < —7.39) = 7.5-107!*. The 
data provide overwhelming evidence against Ho : pp = 240, so that we conclude 
that the expected length of an eruption is different from 4 minutes. 


QUICK EXERCISE 27.3 Compute the critical region K for the test, using the 
normal approximation, and check that t = —7.39 falls in K. 


In fact, if we would test Ho : = 240 against Hy : w < 240, the p-value 
corresponding to t = —7.39 is the left tail probability P(T’ < —7.39). This 
probability is very small, so that we also reject the null hypothesis in favor 
of this alternative and conclude that the expected length of an eruption is 
smaller than 4 minutes. 


27.3 The ¢test in a regression setting 


Is calcium in your drinking water good for your health? In England and Wales, 
an investigation of environmental causes of disease was conducted. The annual 
mortality rate (percentage of deaths) and the calcium concentration in the 
drinking water supply were recorded for 61 large towns. The data in Table 27.3 
represent the annual mortality rate averaged over the years 1958-1964, and 
the calcium concentration in parts per million. In Figure 27.3 the 61 paired 
measurements are displayed in a scatterplot. The scatterplot shows a slight 
downward trend, which suggests that higher concentrations of calcium lead 
to lower mortality rates. The question is whether this is really the case or if 
the slight downward trend should be attributed to chance. 


To investigate this question we model the mortality data by means of a simple 
linear regression model with normally distributed errors, with the mortality 
rate as the dependent variable y and the calcium concentration as the inde- 
pendent variable x: 


Y,=a+6x7,+U; fori =1,2,...,61, 


where U;,U2,...,U61 is a random sample from an N(0, 07) distribution. The 
parameter (6 represents the change of the mortality rate if we increase the 
calcium concentration by one unit. We test the null hypothesis Ho : 6 = 0 
(calcium has no effect on the mortality rate) against H, : G < 0 (higher 
concentration of calcium reduces the mortality rate). 


This example illustrates the general situation, where the dataset 


(21,41), (x2, y2), sree (Zn, Yn) 


406 27 The ttest 


Table 27.3. Mortality data. 


Rate Calcium Rate Calcium Rate Calcium Rate Calcium 


1247 
1392 
1260 
1259 
1236 
1627 
1581 
1609 
1755 
1723 
1569 
1704 
1696 
1987 
1557 
1378 


105 
73 
21 

133 

101 
53 
14 
18 


13 
71 


1466 
1307 
1096 
1175 
1369 
1486 
1625 
1558 
1491 
1379 
1591 
1702 
1711 
1495 
1640 


5 1299 
78 1254 
138 1402 
107 1486 
68 1257 
122 1485 
13 1668 
10 1807 
20 1555 
94 1742 
16 1772 
44 1427 
13 1444 
14 1587 
57 1709 


78 
96 
37 
5 
50 
81 
17 
15 
39 
8 
15 
27 
14 
79 
71 


1359 
1318 
1309 
1456 
1527 
1519 
1800 
1637 
1428 
1574 
1828 
1724 
1591 
1713 
1625 


84 
122 
59 
90 
60 


Source: M. Hills and the M345 Course Team. M345 Statistical Methods, 
Units 3: Examining Straight-line Data, 1986, Milton Keynes: © Open Uni- 
versity, 28. Data provided by Professor M.J.Gardner, Medical Research Coun- 
cil Environmental Epidemiology Research Unit, Southampton. 


Mortality rate (%) 


is modeled by a simple linear regression model, and one wants to test a null 
hypothesis of the form Hp : a = ag or Ho : 2 = Go. Similar to the one-sample 
t-test we will construct a test statistic for each of these null hypotheses. With 
normally distributed errors, these test statistics have a t-distribution under 
the null hypothesis. For this reason, for both null hypotheses the test is called 


a ttest. 


3.0 


2.5 


2.0 


1.5 


1.0 


0.5 


0.0 


40 60 


Calcium concentration (ppm) 


Fig. 27.3. Scatterplot mortality data. 


27.3 The t-test in a regression setting 407 


The t-test for the slope 


For the null hypothesis Hp : 3 = (Go, we use as test statistic 


In this expression, 


is the estimator for a? as introduced on page 332. It can be shown that 


n n ‘ 
Ver(- fe) = pear ee 
so that the random variable S$? is an estimator for the variance of B — Bo. 
Hence, similar to the test statistic for the one-sample t-test, the test statistic T) 
compares the estimator B with the value @p and standardizes by dividing by 
an estimator for the standard deviation of B — Jo. Values of JT, close to zero 
are in favor of the null hypothesis Ho : @ = ({o. Large positive values of Ti, 
suggest that @ > §o, whereas large negative values of T;, suggest that @ < (po. 


Recall that in the case of normal random samples the one-sample t-test statis- 
tic has a ¢(n — 1) distribution under the null hypothesis. For the same reason, 
it is also a fact that in the case of normally distributed errors the test statis- 
tic T, has a ¢(n — 2) distribution under the null hypothesis Ho : @ = {o. 


In our mortality example we want to test Ho : = 0 against Ho : 3 < 0. For 
the data we find G = —3.2261 and s, = 0.4847, so that the value of Tj, is 
—3.2261 
th = —— = —6. , 
> "0.4847 one 
If we test at level a = 0.05, then we must compare this value with the left 
critical value —ts9,0.05. This value is not in Table B.2, but we have that 


—1.676 = —ts50,0.05 < —ts9,0.05- 


This means that ty is much smaller than —ts9 9.95, so that we reject the null hy- 
pothesis at level 0.05. How much evidence the value ty = —6.656 bears against 
the null hypothesis is expressed by the one-tailed p-value P(T), < —6.656). 
From Table B.2 we can only see that this probability is smaller than 0.0005. 
By means of a statistical package we find P(T, < —6.656) = 5.2- 107%. The 
data provide overwhelming evidence against the null hypothesis. We conclude 
that higher concentrations of calcium correspond to lower mortality rates. 


408 27 The ttest 


QUICK EXERCISE 27.4 The data in Table 27.3 can be separated into measure- 
ments for towns at least as far north as Derby and towns south of Derby. For 
the data corresponding to 35 towns at least as far north as Derby, one finds 
6B = —1.9313 and s, = 0.8479. Test Ho : 6 = 0 against Hp : GB < 0 at level 
0.01, i.e., compute the value of the test statistic and report your conclusion 
about the null hypothesis. 


The t-test for the intercept 
We test the null hypothesis Hp : a = ap with test statistic 


a— ao 


Ty = : 27.1 
= (27.1) 
where @ is the least squares estimator for a and 

g2 — EF a2 


with G? defined as before. The random variable $? is an estimator for the 
variance 


Var(@ — ao) 


Again, we compare the estimator @ with the value ap and standardize by 
dividing by an estimator for the standard deviation of @ — ao. Values of Ty, 
close to zero are in favor of the null hypothesis Hp : a = ag. Large positive 
values of T, suggest that @ > ag, whereas large negative values of T, suggest 
that @ < ao. Like J), in the case of normal errors, the test statistic JT, has a 
t(n — 2) distribution under the null hypothesis Ho : a = ao. 
As an illustration, recall Exercise 17.9 where we modeled the volume y of 
black cherry trees by means of a linear model without intercept, with inde- 
pendent variable x = dh, where d and h are the diameter and height of the 
trees. The scatterplot of the pairs (21, y1), (w2, y2),---,(@31, ygi) is displayed 
in Figure 27.4. As mentioned in Exercise 17.9, there are physical reasons to 
leave out the intercept. We want to investigate whether this is confirmed by 
the data. To this end, we model the data by a simple linear regression model 
with intercept 

Y,=a+6r7,+U; fori =1,2,...,31, 
where U,, U2,..., U3, are arandom sample from an N(0, 07) distribution, and 
we test Hp : a = 0 against H; : a # 0 at level 0.10. The value of the test 
statistic is 0.2977 
Tone —0.3089, 
and the left critical value is —t2z9,0.05 = —1.699. This means we cannot reject 
the null hypothesis. The data do not provide sufficient evidence against Ho : 
a = 0, which is confirmed by the one-tailed p-value P(T,, < —0.3089) = 0.3798 
(computed by means of a statistical package). We conclude that the intercept 
does not contribute significantly to the model. 


a= 


27.4 Solutions to the quick exercises 409 


2.5 
2.0 
1.5 e 


1.0 7? 


0.0 
0) 2 4 6 8 


Fig. 27.4. Scatterplot of the black cherry tree data. 


27.4 Solutions to the quick exercises 


27.1 If Y has an N(1,1) distribution, then Y — 1 has an N(0,1) distri- 
bution. Therefore, from Table B.1: P(Y — 1 > 0.03) = 0.4880. If Y has an 
N(1,(0.01)?) distribution, then (Y — 1)/0.01 has an N(0,1) distribution. In 
that case, 


¥=1 
P(Y —1> 0.03) = (=o > 3) = 0.0013. 


27.2 For the first and last ten measurements the values of the test statistic 
are 


po 2 ai aa te 00 
0.0290/V10 0.0428 //10 


The critical value to,9.025 = 2.262, which means we reject the null hypothesis 
for the second production line, but not for the first production line. 


27.3 The critical region is of the form K = (—co,q] U [cu, co). The right 
critical value c, is approximated by 20.9005 = too,0.0005 = 3-291, which can be 
found in Table B.2. By symmetry of the normal distribution, the left critical 
value c; is approximated by — 20.9005 = —3.291. Clearly, t = —7.39 < —3.291, 
so that it falls in kK. 


27.4 The value of the test statistic is 


—1.9313 
UOT, 
ty = 8470 ei 


The left critical value is equal to —t33.9.01, which is not in Table B.2, but we 
see that —33,0.01 < —t40,0.01 = —2.423. This means that —33,0.01 < ty, so 
that we cannot reject Ho : 6 =0 against Ho : 6 < 0 at level 0.01. 


410 27 The ttest 


27.5 Exercises 


27.1 We perform a t-test for the null hypothesis Ho : ~ = 10 by means of 
a dataset consisting of n = 16 elements with sample mean 11 and sample 
variance 4. We use significance level 0.05. 


a. Should we reject the null hypothesis in favor of Hy : uw 4 10? 
b. What if we test against Hy : u > 10? 


27.2 E) The Cleveland Casting Plant is a large highly automated producer of 
gray and nodular iron automotive castings for Ford Motor Company. One 
process variable of interest to Cleveland Casting is the pouring tempera- 
ture of molten iron. The pouring temperatures (in degrees Fahrenheit) of ten 
crankshafts are given in Table 27.4. The target setting for the pouring tem- 
perature is set at 2550 degrees. One wants to conduct a test at level a = 0.01 
to determine whether the pouring temperature differs from the target setting. 


Table 27.4. Pouring temperatures of ten crankshafts. 


2543 2541 2544 2620 2560 
2559 2562 2553 2552 2553 


© 1995 From A structural model relating process inputs and final prod- 
uct characteristics, Quality Engineering, , Vol 7, No. 4, pp. 693-704, by 
Price, B. and Barth, B. Reproduced by permission of Taylor & Francis, Inc., 
http//www.taylorandfrancis.com 


a. Formulate the appropriate null hypothesis and alternative hypothesis. 

b. Compute the value of the test statistic and report your conclusion. You 
may assume a normal model distribution and use that the sample variance 
is 517.34. 


27.3 Table 27.5 lists the results of tensile adhesion tests on 22 U-700 alloy 
specimens. The data are loads at failure in MPa. The sample mean is 13.71 
and the sample standard deviation is 3.55. You may assume that the data 
originated from a normal distribution with expectation pu. One is interested 
in whether the load at failure exceeds 10 MPa. We investigate this by means 
of a t-test for the null hypothesis Hp : w = 10. 


a. What do you choose as the alternative hypothesis? 
b. Compute the value of the test statistic and report your conclusion, when 
performing the test at level 0.05. 


27.5 Exercises 411 


Table 27.5. Loads at failure of U-700 specimens. 


19.8 18.5 17.6 16.7 15.8 
15.4 141 136 11.9 11.4 
114 88 7.5 15.4 15.4 
19.5 14.9 12.7 11.9 11.4 
10.1 7.9 


Source: C.C. Berndt. Instrumented Tensile adhesion tests on plasma sprayed 
thermal barrier coatings. Journal of Materials Engineering I1(4): 275-282, 
Dec 1989. © Springer-Verlag New York Inc. 


27.4 Consider the coal data from Table 23.2, where 22 gross calorific value 
measurements are listed for Daw Mill coal coded 258GB41. We modeled this 
dataset as a realization of a random sample from an N(j,07) distribution 
with yw and o unknown. We are planning to buy a shipment if the gross 
calorific value exceeds 31.00 MJ/kg. The sample mean and sample variance 
of the data are Z,, = 31.012 and s,, = 0.1294. Perform a t-test for the null 
hypothesis Ho : u = 31.00 against H, : uw > 31.00 using significance level 0.01, 
i.e., compute the value of the test statistic, the critical value of the test, and 
report your conclusion. 


27.5 In the November 1988 issue of Science a study was reported on the 
inbreeding of tropical swarm-founding wasps. Each member of a sample of 
197 wasps was captured, frozen, and subjected to a series of genetic tests, 
from which an inbreeding coefficient was determined. The sample mean and 
the sample standard deviation of the coefficients are %197 = 0.044 and s197 = 
0.884. If a species does not have the tendency to inbreed, their true inbreeding 
coefficient is 0. Determine by means of a test whether the inbreeding coefficient 
for this species of wasp exceeds 0. 


a. Formulate the appropriate null hypothesis and alternative hypothesis and 
compute the value of the test statistic. 


b. Compute the p-value corresponding to the value of the test statistic and 
report your conclusion about the null hypothesis. 


27.6 The stopping distance of an automobile is related to its speed. The data 
in Table 27.6 give the stopping distance in feet and speed in miles per hour 
of an automobile. The data are modeled by means of simple linear regression 
model with normally distributed errors, with the square root of the stopping 
distance as dependent variable y and the speed as independent variable 2: 


Y,=a+62,+0U0;, fori=1,...,7. 
For the dataset we find 


@ = 5.388, 8 = 4.252, s,=1.874, 5, = 0.242. 


412 27 The ttest 


Table 27.6. Speed and stopping distance of automobiles. 


Speed 20.5 20.5 30.5 30.5 40.5 48.8 57.8 
Distance 15.4 13.3 33.9 27.0 73.1 113.0 142.6 


Source: K.A. Brownlee. Statistical theory and methodology in science and 
engineering. Wiley, New York, 1960; Table II.9 on page 372. 


One would expect that the intercept can be taken equal to 0, since zero speed 
would yield zero stopping distance. Investigate whether this is confirmed by 
the data by performing the appropriate test at level 0.10. Formulate the proper 
null and alternative hypothesis, compute the value of the test statistic, and 
report your conclusion. 


27.7 4 In a study about the effect of wall insulation, the weekly gas con- 
sumption (in 1000 cubic feet) and the average outside temperature (in de- 
grees Celsius) was measured of a certain house in southeast England, for 26 
weeks before and 30 weeks after cavity-wall insulation had been installed. 
The house thermostat was set at 20 degrees throughout. The data are listed 
in Table 27.7. We model the data before insulation by means of a simple lin- 
ear regression model with normally distributed errors and gas consumption 
as response variable. A similar model was used for the data after insulation. 
Given are 


Before insulation: @ = 6.8538, B = —0.3932 and sg = 0.1184, s, = 0.0196 
After insulation: @ = 4.7238, B = —0.2779 and sg = 0.1297, s, = 0.0252. 


a. Use the data before insulation to investigate whether smaller outside tem- 
peratures lead to higher gas consumption. Formulate the proper null and 
alternative hypothesis, compute the value of the test statistic, and report 
your conclusion, using significance level 0.05. 


b. Do the same for the data after insulation. 


27.5 Exercises 413 


Table 27.7. Temperature and gas consumption. 


Before insulation After insulation 
Temperature Gas consumption Temperature Gas consumption 
—0.8 7.2 —0.7 4.8 
—0.7 6.9 0.8 4.6 
0.4 6.4 1.0 4.7 
2.5 6.0 1.4 4.0 
2.9 5.8 15 4.2 
3.2 5.8 1.6 4.2 
3.6 5.6 2.3 4.1 
3.9 4.7 2D 4.0 
4,2 5.8 2.5 3.5 
4.3 5.2 3.1 oz 
5.4 4.9 3.9 3.9 
6.0 4.9 4.0 3.5 
6.0 4.3 4.0 3.7 
6.0 4.4 4.2 3.5 
6.2 4.5 4.3 3.5 
6.3 4.6 4.6 3.7 
6.9 3.7 4.7 3.5 
7.0 3.9 4.9 3.4 
7.4 4,2 4.9 3.7 
7.5 4.0 4.9 4.0 
7.5 3.9 5.0 3.6 
7.6 3.5 5.3 3.7 
8.0 4.0 6.2 2.8 
8.5 3.6 7.1 3.0 
9.1 3. td 2.8 
10.2 2.6 7.5 2.6 
8.0 2.7 
8.7 2.8 
8.8 1.3 
9.7 1.5 


Source: MDST242 Statistics in Society, Unit 45: Review, 2nd edition, 1984, 
Milton Keynes: © The Open University, Figures 2.5 and 2.6. 


28 


Comparing two samples 


Many applications are concerned with two groups of observations of the same 
kind that originate from two possibly different model distributions, and the 
question is whether these distributions have different expectations. We de- 
scribe a test for equality of expectations, where we consider normal and non- 
normal model distributions and equal and unequal variances of the model 
distributions. 


28.1 Is dry drilling faster than wet drilling? 


Recall the drilling example from Sections 15.5 and 16.4. The question was 
whether dry drilling is faster than wet drilling. The scatterplots in Figure 15.11 
seem to suggest that up to a depth of 250 feet the drill time does not depend 
on depth. Therefore, for a first investigation of a possible difference between 
dry and wet drilling we only consider the (mean) drill times up to this depth. 
A more thorough study can be found in [23]. 


The boxplots of the drill times for both types of drilling are displayed in 
Figure 28.1. Clearly, the boxplot for dry drilling is positioned lower than the 


1000 
900 
800 
700 


600 


Dry Wet 


Fig. 28.1. Boxplot of drill times. 


416 28 Comparing two samples 


one for wet drilling. However, the question is whether this difference can be 
attributed to chance or if it is large enough to conclude that the dry drill 
time is shorter than the wet drill time. To answer this question, we model the 
datasets of dry and wet drill times as realizations of random samples from 
two distribution functions F' and G, one with expected value yp, and the other 
with expected value a. The parameters j11 and jt represent the drill times 
of dry drilling and wet drilling, respectively. We test Ho : f41 = pig against 
Ay: Ly < pe. 

This example illustrates a general situation where we compare two datasets 


L1,02,---,%y and 1, Y2,---;Yms 
which are the realization of independent random samples 
X 1, Xo,.-..,Xn and Y45.Y¥95e3 4X 


from two distributions, and we want to test whether the expectations of both 
distributions are the same. Both the variance ay of the X; and the variance 
oe of the Y; are unknown. 


Note that the null hypothesis is equivalent to the statement 4 — 2 = 0. For 
this reason, similar to Chapter 27, the test statistic for the null hypothesis 
Ho : p41 = [2 is based on an estimator X, —Ym for the difference [41 — Hg. As 
before, we standardize X,, — Yi, by an estimator for its variance 


2 2 
Var (Xn — Ym) = “% +. 


Recall that the Sample bccn Sand S$? of the X; and Y;, are unbiased 
estimators for 0% and o¥-. We will use a combination of S% and S¥. to con- 
struct an estimator for Var(X;, ~Y, a The actual standardization of X;,—Yim 
depends on whether the variances of the X; and Y; are the same. We distin- 
guish between the two cases 0% = of and o% # o%. In the next section we 
consider the case of equal variances. 


QUICK EXERCISE 28.1 Looking at the boxplots in Figure 28.1, does the as- 
sumption 0% = o? seem reasonable to you? Can you think of a way to 
quantify your belief? 


28.2 Two samples with equal variances 


Suppose that the samples originate from distributions with the same (but 
unknown) variance: 
2 _ 2 _ 2 
Ox =Oy =o. 
In this case we can pool the sample variances S% and S?. by constructing 
a linear combination aS} + 6S? that is an unbiased estimator for 07. One 
particular choice is the weighted average 


28.2 Two samples with equal variances 417 


(n —1)S% + (m—1)S} 
n+m-—2 : 


It has the property that for normally distributed samples it has the smallest 
variance among all unbiased linear combinations of 9} and S? (see Exer- 
cise 28.5). Moreover, the weights depend on the sample sizes. This is appro- 
priate, since if one sample is much larger than the other, the estimate of a? 
from that sample is more reliable and should receive greater weight. 


We find that the pooled-variance: 
g2 — M=DS& + (m= 1)S¥ € ~) 


P nt+m—2 nm 


is an unbiased estimator for 
= = 1 1 
Var (Xn — Ym) = 07 (+ + ~) , 
This leads to the following test statistic for the null hypothesis Ho : fui = [e2: 


a 7 Vis 


Tp = 5 
p 


As before, we compare the estimator X, — Ym with 0 (the value of fi, — p2 
under the null hypothesis), and we standardize by dividing by the estimator S', 
for the standard deviation of X,, — Yn. Values of Tp close to zero are in favor 
of the null hypothesis Ho : 41 = fz. Large positive values of T, suggest that 
[41 > 2, whereas large negative values suggest that ju < plo. 
The next step is to determine the distribution of T,. Note that under the null 
hypothesis Ho : 41 = pz, the test statistic T, is the pooled studentized mean 
difference 7 7 

(35 = a) _ (1 = H2) 

Sp 

Hence, under the null hypothesis, the probability distribution of JT), is the 
same as that of the pooled studentized mean difference. To determine its 
distribution, we distinguish between normal and nonnormal data. 


Normal samples 


In the same way as the studentized mean of a single normal sample has a 
t(n — 1) distribution (see page 349), it is also a fact that if two independent 
samples originate from normal distributions, 1.e., 


X,, Xo,...,X, random sample from N(11,07) 
Y,, Y2,...,Ym random sample from N(12,07), 


then the pooled studentized mean difference has a t(n + m — 2) distribution. 
Hence, under the null hypothesis, the test statistic T, has a t(n + m — 2) 


418 28 Comparing two samples 


distribution. For this reason, a test for the null hypothesis Ho : 1 = pg is 
called a two-sample t-test. 


Suppose that in our drilling example we model our datasets as realizations 
of random samples of sizes n = m = 50 from two normal distributions with 
equal variances, and we test Ho : wy = 2 against Hy : wy < pe at level 0.05. 
For the data we find %59 = 727.78, Ys0 = 873.02, and sp) = 13.62, so that 
727.78 — 873.02 
tp = pee 10.66. 

We compare this with the left critical value —tgg9.95. This value is not in 
Table B.2, but —1.676 = —t50,0.05 < —tgs.0.05- This means that ty < —tgg 0.05; 
so that we reject Ho : 1 = 2 in favor of Ay : uy < pe at level 0.05. The p- 
value corresponding to t, = —10.66 is the left tail probability P(T < —10.66). 
From Table B.2 we can only see that this is smaller than 0.0005 (a statistical 
software package gives P(T < —10.66) = 2.25-107'8). The data provide over- 
whelming evidence against the null hypothesis, so that we conclude that dry 
drilling is faster than wet drilling. 


QUICK EXERCISE 28.2 Suppose that in the ball bearing example of Quick 
exercise 27.2, we test Ho : 1 = plo against Hy : fy A juz, where py and pio 
represent the diameters of a ball bearing from the first and second production 
line. What are the critical values corresponding to level a = 0.01? 


Nonnormal samples 


Similar to the one-sample t-test, if we cannot assume normal model distribu- 
tions, then we can no longer conclude that our test statistic has a t(n + m — 2) 
distribution under the null hypothesis. Recall that under the null hypothesis, 
the distribution of our test statistic is the same as that of the pooled studen- 
tized mean difference (see page 417). 


To approximate its distribution, we use the empirical bootstrap simulation 
for the pooled studentized mean difference 


(Xn _ Ka) a (1 = H2) 
Sp ; 
Given datasets 21, %2,...,%p and yj, y2,---, Ym, determine their empirical dis- 
tribution functions Ff, and G,, as estimates for F and G. The expectations 
corresponding to F, and Gp, are uj = Tp, and ws = Ym. Then repeat the 
following two steps many times: 


1. Generate a bootstrap dataset xj,25,...,77, from F,, and a bootstrap 
dataset yj, Y3,---.Ym from Gm. 


2. Compute the pooled studentized mean difference for the bootstrap data: 
(Zn — Yn) — (En = Ym) 


: 
Sp 


t= 


oI 


28.3 Two samples with unequal variances 419 
where Z*, and y*, are the sample means of the bootstrap datasets, and 


cyt = MED? + (m= U5)? (1 2 
pi n+m-—2 
with (s%,)? and (s%-)* the sample variances of the bootstrap datasets. 


The reason that in each iteration we subtract Zn — Ym is that ~4— pe is 
the difference of the expectations of the two model distributions. Therefore, 
according to the bootstrap principle we should replace this by the difference 
In — Ym of the expectations corresponding to the two empirical distribution 
functions. 


We carried out this bootstrap simulation for the drill times. The result of this 
simulation can be seen in Figure 28.2, where a histogram and the empirical 
distribution function are displayed for one thousand bootstrap values of ft). 
Suppose that we test Ho : uy = 2 against Hy : py < pz at level 0.05. The 
bootstrap approximation for the left critical value is cf = —1.659. The value 
of t, = —10.66, computed from the data, is much smaller. Hence, also on the 
basis of the bootstrap simulation we reject the null hypothesis and conclude 
that the dry drill time is shorter than the wet drill time. 


0.4 
0.3 
0.2 
0.1 
0.05 
0.0 
f-— —--- = - -—— 1 - 1 
—4 —2 0 2 4 -1.659 0 


Fig. 28.2. Histogram and empirical distribution function of 1000 bootstrap values 
for Ty. 
p 


28.3 Two samples with unequal variances 


During an investigation about weather modification, a series of experiments 
was conducted in southern Florida from 1968 to 1972. These experiments 
were designed to investigate the use of massive silver-iodide seeding. It was 


420 28 Comparing two samples 


Table 28.1. Rainfall data. 


Unseeded 


1202.6 830.1 372.4 345.5 321.2 244.3 
163.0 147.8 95.0 87.0 81.2 68.5 
47.3 41.1 36.6 29.0 286 26.3 
26.1 24.4 21.7 17.3 11.5 4.9 
4.9 1.0 


Seeded 


2745.6 1697.8 1656.0 978.0 703.4 489.1 
430.0 334.1 302.8 274.7 274.7 255.0 
242.5 200.7 198.6 129.6 119.0 118.3 
115.3 92.4 40.6 32.7 31.4 17.5 

7.7 4.1 


Source: J. Simpson, A. Olsen, and J.C. Eden. A Bayesian analysis of a mul- 
tiplicative treatment effect in weather modification. Technometrics, 17:161— 
166, 1975; Table 1 on page 162. 


hypothesized that under specified conditions, this leads to invigorated cumulus 
growth and prolonged lifetimes, thereby causing increased precipitation. In 
these experiments, 52 isolated cumulus clouds were observed, of which 26 were 
selected at random and injected with silver-iodide smoke. Rainfall amounts 
(in acre-feet) were recorded for all clouds. They are listed in Table 28.1. To 
investigate whether seeding leads to increased rainfall, we test Ho : wi = [2 
against Hy : 1 < 2, where j41 and pg represent the rainfall for unseeded and 
seeded clouds. 


In Figure 28.3 the boxplots of both datasets are displayed. From this we 
see that the assumption of equal variances may not be realistic. Indeed, this 
is confirmed by the values s% = 77521 and s} = 423524 of the sample 
variances of the datasets. This means that we need to test Ho : wi = [Me 
without the assumption of equal variances. As before, the test statistic will be 
a standardized version of X,, — Ym, but Ss? is no longer an unbiased estimator 
for 


2 2 
Var (Xn — Yn) = +. 


However, if we estimate o% and of by S% and S?., then the nonpooled variance 


Se ee 
n m 


is an unbiased estimator for Var (Xn, _ Ving): This leads to test statistic 


28.3 Two samples with unequal variances 421 


Unseeded Seeded 


Fig. 28.3. Boxplots of rainfall. 


Again, we compare the estimator X, — Y,, with zero and standardize by 
dividing by an estimator for the standard deviation of X,, — Ym. Values of Ty 
close to zero are in favor of the null hypothesis Ho : “1 = pe. 


QUICK EXERCISE 28.3 Consider the ball bearing example from Quick exer- 
cise 27.2. Compute the value of Ty for this example. 


Under the null hypothesis Ho : 41 = pe, the test statistic 


is equal to the nonpooled studentized mean difference 


(Xn = Vea) — (H1 — Ha) 
Sa 


Therefore, the distribution of Tg under the null hypothesis is the same as that 
of the nonpooled studentized mean difference. Unfortunately, its distribution 
is not a ¢-distribution, not even in the case of normal samples. This means 
that we have to approximate this distribution. 


Similar to the previous section, we use the empirical bootstrap simulation for 
the nonpooled studentized mean difference. The only difference with the proce- 
dure outlined in the previous section is that now in each iteration we compute 
the nonpooled studentized mean difference for the bootstrap datasets: 


* ’ 


Sa 


th = 


where Z* and y*, are the sample means of the bootstrap datasets, and 


(sj)? = Ga, 


422 28 Comparing two samples 


0.4 
0.3 
0.2 
0.1 
0.05 
0.0 
rs rs rs a | 
—4 —2 0 2 4 6 -1.405 0 


Fig. 28.4. Histogram and empirical distribution function of 1000 bootstrap values 
of Tj. 


with (s%-)* and (s}-)? the sample variances of the bootstrap datasets. 

We carried out this bootstrap simulation for the cloud seeding data. The 
result of this simulation can be seen in Figure 28.4, where a histogram and 
the empirical distribution function are displayed for one thousand values t%. 
The bootstrap approximation for the left critical value corresponding to level 
0.05 is cf = —1.405. For the data we find the value 


164.59 — 441.98 
= —____— _ = ~—1.998. 
. 138.92 
This is smaller than cj, so we reject the null hypothesis. Although the evidence 


against the null hypothesis is not overwhelming, there is some indication that 
seeding clouds leads to more rainfall. 


28.4 Large samples 


Variants of the central limit theorem state that as n and m both tend to 
infinity, the distributions of the pooled studentized mean difference 


(Xp, = va) = (1 = [2) 
Sp 
and the nonpooled studentized mean difference 
(Xn = Ya) = (p41 = [2) 
Sa 


both approach the standard normal distribution. This fact can be used to 
approximate the distribution of the test statistics J; and Tq under the null 
hypothesis by a standard normal distribution. 


28.4 Large samples 423 


We illustrate this by means of the following example. To investigate whether a 
restricted diet promotes longevity, two groups of randomly selected rats were 
put on the different diets. One group of n = 106 rats was put on a restricted 
diet, the other group of m = 89 rats on an ad libitum diet (i.e., unrestricted 
eating). The data in Table 28.2 represent the remaining lifetime in days of two 
groups of rats after they were put on the different diets. The average lifetimes 
are Z, = 968.75 and Y¥» = 684.01 days. To investigate whether a restricted 
diet promotes longevity, we test Ho : 1 = w2 against Hy : p41 > fe, where 
fy and pg represent the lifetime of a rat on a restricted diet and on an ad 
libitum diet, respectively. 


If we may assume equal variances, we compute 


_ 968.75 — 684.01 


th = = 8.66. 
- 32.88 


This value is larger than the right critical value 29.9095 = 3.291, which means 
that we would reject Ho : “1 = [2 in favor of Hy : 41 > pe at level a = 0.0005. 


Table 28.2. Rat data. 


Restricted 


105 193 211 236 3802 363 6389 6390) 69391 = 403 
530 604 605 630 716 718 727 731 749 769 
770 «6789 «68040 8810s 811) 8833) 868) 8871) 875898 
897 901 906 907 919 923 931 940 957 958 
961 962 974 979 982 1001 1008 1010 1011 1012 
1014 1017 1032 1039 1045 1046 1047 1057 1063 1070 
1073 1076 1085 1090 1094 1099 1107 1119 1120 1128 
1129 1131 1133 11386 11388 1144 1149 1160 1166 1170 
1173 1181 1183 1188 1190 1203 1206 1209 1218 1220 
1221 1228 1230 1231 1233 1239 1244 1258 1268 1294 
1316 1327 1328 1369 1393 1435 


Ad libitum 


89 104 387 465 479 494 496 514 532 536 
545 547 548 582 606 609 619 620 621 630 
635 639 648 652 653 654 660 665 667 668 
670 675 677 678 678 681 684 688 694 695 
697 698 702 704 710 711 712 715 716 717 
720) 721 730 731) 6732 673306735) 736 738739 
741 743 746 749 751 753 764 765 768 770 
773, 777) 779s 780) 788791794 7967998801 
806 807 815 836 838 850 859 894 963 


Source: B.L. Berger, D.D. Boos, and F.M. Guess. Tests and confidence sets 
for comparing two mean residual life functions. Biometrics, 44:103-115, 1988. 


424 28 Comparing two samples 


The p-value is the right tail probability P(Z,, > 8.66), which we approximate 
by P(Z > 8.66), where Z has an N(0,1) distribution. From Table B.1 we see 
that this probability is smaller than P(Z > 3.49) = 0.0002. By means of a 
statistical package we find P(Z > 8.66) = 2.4-1071°. 


If we repeat the test without the assumption of equal variances, we compute 


_ 968.75 — 684.01 


=9.1 
31.08 ane 


d 
which also leads to rejection of the null hypothesis. In this case, the p-value 
P(Ta > 9.16) + P(Z > 9.16) is even smaller since 9.16 > 8.66 (a statistical 
package gives P(Z > 9.16) = 2.6 - 10718). The data provide overwhelming 
evidence against the null hypothesis, and we conclude that a restricted diet 
promotes longevity. 


28.5 Solutions to the quick exercises 


28.1 Just by looking at the boxplots, the authors believe that the assumption 
o% = 0%, is reasonable. The lengths of the boxplots and their IQRs are almost 
the same. However, the boxplots do not reveal how the elements of the dataset 
vary around the center. One way of quantifying our belief would be to compare 
the sample variances of the datasets. One possibility is to compare the ratio of 
both sample variances; a ratio close to one would support our belief of equal 
variances (in case of normal samples, this is a standard test called the F-test). 


28.2 In this case we have a right and left critical value. From Quick ex- 
ercise 27.2 we know that n = m = 10, so that the right critical value is 
t18,0.005 = 2.878 and the left critical value is —t18,0.005 = —2.878. 


28.3 We first compute s? = (0.0290)?/10+ (0.0428)?/10 = 0.000267 and then 
ta = (1.0194 — 1.0406) /\/0.000267 = —1.297. 


28.6 Exercises 


28.1 © The data in Table 28.3 represent salaries (in pounds Sterling) in 72 
randomly selected advertisements in the The Guardian (April 6, 1992). When 
a range was given in the advertisement, the midpoint of the range is repro- 
duced in the table. The data are salaries corresponding to two kinds of occu- 
pations (n = m = 72): (1) creative, media, and marketing and (2) education. 
The sample mean and sample variance of the two datasets are, respectively: 


(1) @72 = 17410 and s? = 41258741, 
(2) ¥72 = 19818 and s? = 50744521. 


28.6 Exercises 


Table 28.3. Salaries in two kinds of occupations. 


Occupation (1) 


17703 
42000 
18780 
15723 
13179 
37500 
22955 
13000 
13500 
13000 
11000 
12500 
13000 
10500 
12285 
13000 
16000 
15000 
13944 
23960 
11389 
12587 
17000 

9000 


13796 
22958 
10750 
13552 
21000 
18245 
19358 
22000 
12000 
16820 
17709 
23065 
18693 
14472 
12000 
20000 
18900 
14481 
35000 
18000 
30000 
12548 
17048 
13349 


12000 
22900 
13440 
17574 
22149 
17547 

9500 
25000 
15723 
12300 
10750 
11000 
19000 
13500 
32000 
17783 
16600 
18000 
11406 
23000 
15379 
21458 
21262 
20000 


Occupation (2) 


25899 
21676 
15053 
19461 
22485 
17378 
15053 
10998 
18360 
22533 
23008 
24260 
25899 
18021 
17970 
21074 
15053 
20739 
15053 
30800 
37000 
48000 
16000 
20147 


17378 
15594 
17375 
20111 
16799 
12587 
24102 
12755 
35000 
20500 
13000 
18066 
35403 
17378 
14855 
21074 
19401 
15053 
15083 
10294 
11389 
11389 
26544 
14274 


19236 
18780 
12459 
22700 
35750 
20539 
13115 
13605 
20539 
16629 
27500 
17378 
15053 
20594 

9866 
21074 
25598 
15053 
31530 
16799 
15053 
14359 
15344 
31000 


Source: D.J. Hand, F. Daly, A.D. Lunn, K.J. McConway, and E. Ostrowski. 
Small data sets. Chapman and Hall, London, 1994; dataset 385. Data col- 
lected by D.J. Hand. 


425 


Suppose that the datasets are modeled as realizations of normal distributions 
with expectations 4, and 42, which represent the salaries for occupations (1) 


and (2). 


a. Test the null hypothesis that the salary for both occupations is the same 
at level a = 0.05 under the assumption of equal variances. Formulate 
the proper null and alternative hypotheses, compute the value of the test 
statistic, and report your conclusion. 


b. Do the same without the assumption of equal variances. 


c. As acomparison, one carries out an empirical bootstrap simulation for the 
nonpooled studentized mean difference. The bootstrap approximations for 


the critical values are cj 


about the salaries on the basis of the bootstrap results. 


—2.004 and c% = 2.133. Report your conclusion 


426 28 Comparing two samples 


28.2 The data in Table 28.4 represent the duration of pregnancy for 1669 
women who gave birth in a maternity hospital in Newcastle-upon-Tyne, Eng- 
land, in 1954. 


Table 28.4. Durations of pregnancy. 


Duration Medical Emergency Social 


11 1 

15 1 

17 1 

20 1 

22 1 2 

24 1 3 

25 2 1 
26 1 

27 2 2 1 
28 1 2 1 
29 3 1 

30 3 5 1 
31 4 5 2 
32 10 9 2 
33 6 6 2 
34 12 7 10 
35 23 11 4 
36 26 13 19 
37 54 16 30 
38 68 35 72 
39 159 38 115 
40 197 32 155 
Al 111 27 128 
42 55 25 64 
43 29 8 16 
44 4 5 3 
45 3 1 6 
46 1 1 1 
AT 1 

56 1 


Source: D.J. Newell. Statistical aspects of the demand for maternity beds. 
Journal of the Royal Statistical Society, Series A, 127:1—33, 1964. 


The durations are measured in complete weeks from the beginning of the last 
menstrual period until delivery. The pregnancies are divided into those where 
an admission was booked for medical reasons, those booked for social reasons 
(such as poor housing), and unbooked emergency admissions. For the three 
groups the sample means and sample variances are 


28.6 Exercises 427 


Medical: 775 observations with % = 39.08 and s? = 7.77, 
Emergency: 261 observations with 7 = 37.59 and s? = 25.33, 
Social: 633 observations with % = 39.60 and s? = 4.95. 


Suppose we view the datasets as realizations of random samples from normal 
distributions with expectations ji, 42, and jz and variances o7, 03, and 03, 
where pi; represents the duration of pregnancy for the women from the ith 
group. We want to investigate whether the duration differs for the different 
groups. For each combination of two groups test the null hypothesis of equality 
of ;. Compute the values of the test statistic and report your conclusions. 


28.3 GE] In a seven-day study on the effect of ozone, a group of 23 rats was 
kept in an ozone-free environment and a group of 22 rats in an ozone-rich 
environment. From each member in both groups the increase in weight (in 
grams) was recorded. The results are given in Table 28.5. The interest is in 
whether ozone affects the increase of weight. We investigate this by testing 
Ao: pu = pg against Hy : ji A 2, where py and pg denote the increases of 
weight for a rat in the ozone-free and ozone-rich groups. The sample means 
are 


Ozone-free: %23 = 22.40 
Ozone-rich: 2 = 11.01. 


The pooled standard deviation is s, = 4.58, and the nonpooled standard 
deviation is sq = 4.64. 


Table 28.5. Weight increase of rats. 


Ozone-free Ozone-rich 


41.0 384 244 10.1 6.1 20.4 
25.9 21.9 18.3 7.3 143 15.5 
13.1 27.3 28.5 —9.9 6.8 28.2 

—16.9 17.4 21.8 17.9 -—12.9 14.0 
15.4 27.4 19.2 6.6 12.1 15.7 
22.4 17.7 26.0 39.9 —-15.9 54.6 
29.4 21.4 22.7 —-14.7 441 —9.0 
26.0 26.6 —9.0 


Source: K.A. Doksum and G.L. Sievers. Plotting with confidence: graphical 
comparisons of two populations. Biometrika, 63(3):421—434, 1976; Table 10 
on page 433. By permission of the Biometrika Trustees. 


a. Perform the test at level 0.05 under the assumption of normal data with 
equal variances, i.e., compute the test statistic and report your conclusion. 

b. One also carries out a bootstrap simulation for the test statistic used in 
a, and finds critical values cf = —1.912 and c} = 1.959. What is your 
conclusion on the basis of the bootstrap simulation? 


428 28 Comparing two samples 


c. Also perform the test at level 0.05 without the assumption of equal vari- 
ances, where you may use the normal approximation for the distribution 
of the test statistic under the null hypothesis. 


d. A bootstrap simulation for the test statistic in c yields that the right tail 
probability corresponding to the observed value of the test statistic in 
this case is 0.014. What is your conclusion on the basis of the bootstrap 
simulation? 


28.4 Show that in the case when n = m, the random variables T;, and Ty are 
the same. 


28.5 H Let X1, Xo,...,X, and Yi, Y2,...,¥m be independent random sam- 
ples from normal distributions with variances o?. It can be shown that 
20% 20% 


2\ _ 
——z and Var (Sy) = ——- 


Var($%) = 


Consider linear combinations aS% + bS%. that are unbiased estimators for o?. 


a. Show that a and b must satisfy a+ b= 1. 


b. Show that Var(aS} + (1 — a)S?) is minimized for a = (n—1)/(n+m—2) 
(and hence b = (m— 1)/(n+m — 2)). 


28.6 Let X1,Xo,...,X, and Y), Y2,...,¥m be independent random samples 
from distributions with (possibly unequal) variances 0% and o}. 


a. Show that 


2 2 
Var(Xn —¥n) = ~ a a 


b. Show that the pooled variance Dei as defined on page 417, is a biased 
estimator for Var ee _ ane 


c. Show that the nonpooled variance Sj, as defined on page 420, is the only 
unbiased estimator for Var(X;,— Ym) of the form aS%, + bS}.. 


d. Suppose that o% = of = 07. Show that $7, as defined on page 417, is an 


unbiased estimator for Var(Xn — Ym) = o?(1/n+1/m). 


e. Is $3 also an unbiased estimator for Var(X;, — Ym) in the case 0% # of? 
What about when n = m? 


A 


Summary of distributions 


Discrete distributions 


1. Bernoulli distribution: Ber(p), where 0 < p <1. 
P(X =1)=p and P(X =0)=1-p. 
E[X]=p and Var(X) = p(1—>p). 

2. Binomial distribution: Bin(n,p), where 0 < p< 1. 
P(X =k) = @ie —p)”-* for k=0,1,...,n. 
E[X]=np and Var(X) = np(1 — p). 

3. Geometric distribution: Geo(p), where 0 < p< 1. 
P(X =k) =pd—p)* for k= 1,2)..0. 
E[X]=1/p and Var(X) = (1—p)/p?. 


4. Poisson distribution: Pois(~), where u > 0. 


uh 
P(X =k)= 7 e™ fork =0,1,.... 


E[X] =p and Var(X) = wu. 


Continuous distributions 


1. Cauchy distribution: Cau(a, 8), where —co < a < oo and B > 0. 


_ B 
He) = gee) 


1 oil - 
F(a) = 5 + = arctan(= 


for -o <4<om. 


a 
) for —coo <4 < oO. 


E[X] and Var(X) do not exist. 


430 A Summary of distributions 


2. Exponential distribution: Exp(X), where > 0. 
f(z) =e” for x > 0. 
F(z)=1-—e-** forz>0. 
E[X]=1/A and Var(X) =1/)?. 
3. Gamma distribution: Gam/(a, A), where a > 0 and A > 0. 
a-1 dr 
f(x) = ae for x > 0. 


[ AO 
0 


F(a) Ta) dt fora>0. 


E[X]=a/A and Var(X)=a/d?. 


4. Normal distribution: N(j,07), where —oo < pp < oo anda > 0 


i (Sey 
f(x) = e +( 7 ) for —co < 4% < oo. 
oV 20 
x 1 -3(S4)" 
F(a) = | e “\°/ dt for-co<2<oo. 
—co oV2T 


E[X]=p and Var(X)=0o?. 

5. Pareto distribution: Par(a), where a > 0. 
f(z) = 
F(#)=1l-a° forg>1. 

E[X] = a/(a—1) fora >1 and o for0<a<1. 


Var(X) = a/((a — 1)?(a@ — 2)) for a > 2 and 00 for0 <a<1. 
6. Uniform distribution: U(a,b), where a < b. 


a 
Rone for x > 1. 
x 


1 
So < < . 
f(z) = fora<a<b 


fora<a<ob. 


(a+b)/2 and Var(X) = (b—a)?/12. 


B 


Tables of the normal and t-distributions 


432 B Tables of the normal and t-distributions 


Table B.1. Right tail probabilities 1 —- ®(a) = P(Z > a) for an N(0,1) distributed 
random variable Z. 


a 0 1 2 3 4 5 6 te 8 9 
0.0 5000 4960 4920 4880 4840 4801 4761 4721 4681 4641 
0.1 4602 4562 4522 4483 4443 4404 4364 4325 4286 4247 
0.2 4207 4168 4129 4090 4052 4013 3974 3936 3897 3859 
0.3 3821 3783 3745 3707 3669 3632 3594 3557 3520 3483 
0.4 3446 3409 3372 3336 3300 3264 3228 3192 3156 3121 
0.5 3085 3050 3015 2981 2946 2912 2877 2843 2810 2776 
0.6 2743 2709 2676 2643 2611 2578 2546 2514 2483 2451 
0.7 2420 2389 2358 2327 2296 2266 2236 2206 2177 2148 
0.8 2119 2090 2061 2033 2005 1977 1949 1922 1894 1867 
0.9 1841 1814 1788 1762 1736 1711 1685 1660 1635 1611 
1.0 1587 1562 1539 1515 1492 1469 1446 1423 1401 1379 
al 1357 1335 1314 1292 1271 1251 1230 1210 1190 1170 
1.2 1151) 1131 1112 1093 1075 1056 1038 1020 1003 0985 
1.3 0968 0951 0934 0918 0901 0885 0869 0853 0838 0823 
1.4 0808 0793 0778 0764 0749 0735 0721 0708 0694 0681 
1.5 0668 0655 0643 0630 0618 0606 0594 0582 0571 0559 
1.6 0548 0537 0526 0516 0505 0495 0485 0475 0465 0455 
1.7 0446 0436 0427 0418 0409 0401 0392 0384 0375 0367 
1.8 0359 0351 0344 0336 0329 0322 0314 0307 0301 0294 
1.9 0287 0281 0274 0268 0262 0256 0250 0244 0239 0233 
2.0 0228 0222 0217 0212 0207 0202 0197 0192 0188 0183 
2.1 0179 0174 0170 0166 0162 0158 0154 0150 0146 0143 
2.2 0139 0136 0132 0129 0125 0122 0119 0116 0113 0110 
2.3 0107 0104 0102 0099 0096 0094 0091 0089 0087 0084 
2.4 0082 0080 0078 0075 0073 0071 0069 0068 0066 0064 
2.5 0062 0060 0059 0057 0055 0054 0052 0051 0049 0048 
2.6 0047 0045 0044 0043 0041 0040 0039 0038 0037 0036 
2.7 0035 0034 0033 0032 0031 0030 0029 0028 0027 0026 
2.8 0026 0025 0024 0023 0023 0022 0021 0021 0020 0019 
2.9 0019 0018 0018 0017 0016 0016 0015 0015 0014 0014 
3.0 0013 0013 0013 0012 0012 0011 0011 0011 0010 0010 
3.1 0010 0009 0009 0009 0008 0008 0008 0008 0007 0007 
3.2 0007 0007 0006 0006 0006 0006 0006 0005 0005 0005 
3.3 0005 0005 0005 0004 0004 0004 0004 0004 0004 0003 


3.4 0003 0003 0003 0003 0003 0003 0003 0003 0003 0002 


B Tables of the normal and t-distributions 


433 


Table B.2. Right critical values tm,, of the t-distribution with m degrees of freedom 
corresponding to right tail probability p: P(Im > tm,p) = p. The last row in the table 
contains right critical values of the N(0,1) distribution: too,p = Zp. 


3 


PRR rR 
PWNrF DOVUOVAND OK WN FH 


rae 
ot 


BRR 
oon mn 


NO 
So 


Nnwbpy YNNWwWwW 
OND oK WN HR 


0.1 


3.078 
1.886 
1.638 
1.533 
1.476 


1.440 
1.415 
1.397 
1.383 
1.372 


1.363 
1.356 
1.350 
1.345 
1.341 


1.337 
1.333 
1.330 
1.328 
1.325 


1.323 
1.321 
1.319 
1.318 
1.316 


1.315 
1.314 
1.313 
1.311 
1.310 
1.303 
1.299 
1.282 


0.05 


6.314 
2.920 
2.353 
2.132 
2.015 


1.943 
1.895 
1.860 
1.833 
1.812 


1.796 
1.782 
1.771 
1.761 
1.753 


1.746 
1.740 
1.734 
1.729 
1.725 


1.721 
1.717 
1.714 
1.711 
1.708 


1.706 
1.703 
1.701 
1.699 
1.697 
1.684 
1.676 
1.645 


Right tail probability p 


0.025 


12.706 
4.303 
3.182 
2.776 
2.571 


2.447 
2.365 
2.306 
2.262 
2.228 


2.201 
2.179 
2.160 
2.145 
2.131 


2.120 
2.110 
2.101 
2.093 
2.086 


2.080 
2.074 
2.069 
2.064 
2.060 


2.056 
2.052 
2.048 
2.045 
2.042 
2.021 
2.009 
1.960 


0.01 


31.821 
6.965 
4.541 
3.747 
3.365 


3.143 
2.998 
2.896 
2.821 
2.764 


2.718 
2.681 
2.650 
2.624 
2.602 


2.983 
2.567 
2.952 
2.939 
2.528 


2.518 
2.508 
2.500 
2.492 
2.485 


2.479 
2.473 
2.467 
2.462 
2.457 
2.423 
2.403 
2.326 


0.005 


63.657 
9.925 
5.841 
4.604 
4.032 


3.707 
3.499 
3.355 
3.250 
3.169 


3.106 
3.055 
3.012 
2.977 
2.947 


2.921 
2.898 
2.878 
2.861 
2.845 


2.831 
2.819 
2.807 
2.797 
2.787 


2.779 
2.771 
2.763 
2.756 
2.750 
2.704 
2.678 
2.576 


0.0025 


127.321 
14.089 
7.453 
5.598 
4.773 


4.317 
4.029 
3.833 
3.690 
3.581 


3.497 
3.428 
3.372 
3.326 
3.286 


3.252 
3.222 
3.197 
3.174 
3.153 


3.135 
3.119 
3.104 
3.091 
3.078 


3.067 
3.057 
3.047 
3.038 
3.030 
2.971 
2.937 
2.807 


0.001 


318.309 
22.327 
10.215 

7.173 
5.893 


5.208 
4.785 
4.501 
4.297 
4.144 


4.025 
3.930 
3.852 
3.787 
3.733 


3.686 
3.646 
3.610 
3.579 
3.552 


3.527 
3.505 
3.485 
3.467 
3.450 


3.435 
3.421 
3.408 
3.396 
3.385 
3.307 
3.261 
3.090 


0.0005 


636.619 
31.599 
12.924 

8.610 
6.869 


5.959 
5.408 
5.041 
4.781 
4.587 


4.437 
4.318 
4.221 
4.140 
4.073 


4.015 
3.965 
3.922 
3.883 
3.850 


3.819 
3.792 
3.768 
3.745 
3.725 


3.707 
3.690 
3.674 
3.659 
3.646 
3.551 
3.496 
3.291 


C 


Answers to selected exercises 


2.1 P(AUB) = 13/18. 3.4 P(B|T)=9.1- 107° and 
oA Wee. P(B|T*) =4.3- 107°. 
2.8 P(D,U D2) < 2- 1078 and 3.7b P(B) = 1/3. 
P(Di NM De) < 107%. 3.8a P(W) =0.117. 
2.11 p=(-1+V5)/2. 3.8b P(F|W) = 0.846. 
2.12a 1/10! 3.9 P(B|.A) =7/15. 
2.12b 5!-5! 3.14a P(W|R) =O0and P(W| R°) = 1. 
2.12¢ 8/63 = 12.7 percent. 3.14b P(W) = 2/3. 
2.14a —___ 3.16a P(D|T) = 0.165. 
a b c 3.16 b 0.795. 
a 0 1/6 1/6 A.la a 0 1 2 
b 0 0 1/3 pz(a) 25/36 10/36 1/36 
0 1/3 0 Z has a Bin(2,1/6) distribution. 
2.14b P({(a,b), (a,c)}) = 1/3. 4.1b {M =2,Z =0} = { (2,1), (1,2), 


(2, 2) }, {S=5,Z=1}=9, and 


2.14¢ P({(b,c), (c,b)}) = 2/3. one) aan See 


2.16 P(E) = 2/3. P(M = 2, Z=0) = 1/12, 

2.19a 0 = {2,3,4,...}. P(S =5,Z =1) =0, and 

2.19b 4p(1 —p)?. P2325) =1/16: 

3.1 7/36. 4.1c The events are dependent. 

3.2a P(A|B) =2/11. 43 a 0 1/2 3/4 

3.2b No. p(a) 1/3 1/6 1/2 

3.3a P(S;) = 13/52 = 1/4, 4.6a px(1) = px(3) = 1/27, px(4/3) = 
P(S2| Si) = 12/51, and px (8/3) = 3/27, px(5/3) = px(7/3) = 
P(S2 | Sf) = 13/51. 6/27, and px(2) = 7/27. 


3.3b P(S2) = 1/4. 4.6b 6/27. 


436 C Answers to selected exercises 


4.7a Bin(1000, 0.001). 


4.7b P(X =0) = 0.3677, P(X =1) = 
0.3681, and P(X > 2) = 0.0802. 


4.8a Bin(6,0.8178). 

4.8b 0.9999634. 

4.10a Determine P(R; = 0) first. 
4.10b No! 

4.10c See the birthday problem in Sec- 
tion 3.2. 

4.12 No! 

4.13a Geo(1/N). 

4.13b Let D; be the event that the 
marked bolt was drawn (for the first 
time) in the ith draw, and use condi- 
tional probabilities in 

P(Y =k) =P(DUN---N Dg_, N Dx). 
4.13c Count the number of ways the 
event {Z = k} can occur, and divide this 
by the number of ways ) we can select 
r objects from N objects. 

5.2 P(1/2< X < 3/4) =5/16. 

5.4a P(X < 41/2) = 1/4. 

5.4b P(X = 5)=1/2. 

5.4c X is neither discrete nor continu- 
ous! 

5.5a c=1. 

5.5b F(x) =0 for « < —3; 

F(x) = (a +3)?/2 for -3 < # < —2; 
F(x) = 1/2 for -2< a2 < 2; 

F(x) =1-(3—2)?/2 for2<a2 <3; 
F(x) =1 for x > 3. 

5.8a g(y) = 1/(2/7y). 

5.8b Yes. 

5.8c Consider F'(r/10). 

5.9a 1/2 and {(2,y):2<a%<3,1<y< 
3/2}. 

5.9b F(x) =0 for « <0; 

F(a) = 2a for0 <a < 1/2; 

F(x) =1 for x > 1/2. 

5.9c¢ f(x) =2for0< 2 < 1/2; 

f(a) = 0 elsewhere. 

5.12 2. 


5.13a Change variables from x to —2. 
5.13b P(Z < —2) = 0.0228. 

6.2a 1+2V0.378--- = 2.2300795. 
6.2 b Smaller. 

6.2¢ 0.3782739. 

6.5 Show, for a > 0, that X < a is 


a 


equivalent with U >e™°. 


6.6 U=e 2%, 

60: GS/s, oe 
Z=.f-nUyp. 

6.9a 6/8. 


6.9b Geo(6/8). 


6.10a Define B; = 1 if Ui; < p and 
B; = 0 if U; > p, and N as the posi- 
tion in the sequence of B; where the first 
1 occurs. 


6.10b P(Z>n) = (1— p)”, for n = 
0,1,...; Z has a Geo(p) distribution. 
7.1a Outcomes: 1, 2, 3, 4,5, and 6. Each 
has probability 1/6. 
7.1b E[T] =7/2, Var(T) = 35/12. 
7.2a E[X]=1/5. 
7.2b y 0 1 

P(Y =y) 2/5 3/5 
and E[Y] = 3/5. 
7.20 E[X?] = 3/5. 
7.2d Var(X) = 14/25. 
7.5 E[X] =p and Var(X 
7.6 195/76. 
7.8 E[X]=1/3. 
7.10a E[X] =1/A and E[X?] = 2/7. 
7.10b Var(X) = 1/2?. 
7.11la 2. 
7.11b The expectation is infinite! 
Tile Bx] => eae de. 
7.15a Start with 
Var(rX) = E[(rX — E[rX])?]. 
7.15b Start with Var(X +s) = 
E[((X +s) —E[X +5])*]. 
7.15c Apply b with rX instead of X. 
7.16 E[X] = 4/9. 


) =p(1—p). 


7.17a If positive terms add to 
they must all be zero. 


7.17b Note that 

E[(V — E[V])?] = Var(V). 

8.1 y 0 10 20 

0.2 0.4 0.4 
-1 0 at 
1/6 1/2 1/3 
-1 0 1 

P(Z=z) 1/3 1/2 1/6 
8.2c P(W=1)=1. 
8.3a V has a U(7,9) distribution. 


8.3b rU +s has a U(s,s +r) distribu- 
tion ifr > 0 and a U(s+r,s) distribution 
ifr <0. 


8.5a 2°(3—2)/4 for0O< a <2. 


8.5b Fy(y) = (3/4)y* — (1/4)y° for 0 < 
y < v2. 


8.5 3y® — (3/2)y° for 0 <y < v2, 
0 elsewhere. 


8.8 Fw(w)=1—e°™ , with y=A*. 
8.10 0.1587. 
8.11 Apply Jensen with —g. 
8.12a y 0 1 10 100 

PY =o) Ga. @ 
8.12b VEIX] = E[ VX]. 
8.12¢ \/B[X] = 50.25, but E [vx] = 
27.75. 


8.18 V has an exponential distribution 
with parameter ni. 


Zero, 


8.19a The upper right quarter of the 
circle. 


8.19b Fz(t) = 1/24 arctan(t)/z. 
8.19¢ 1/[x(1 + 27)). 

9.2a P(X =0,Y =-1)=1/6, 
P(X =0,Y =1) =0, 

P(X =1,Y =-1) = 1/6, 

P(X =2,Y =-1)=1/6, 

and P(X =2,Y =1)=0. 


C Answers to selected exercises 437 


9.2b Dependent. 
9.5a 1/16<7< 1/4. 
9.5b No. 
9.6a 
U 
v 0 1 2 
0 1/4 0 1/4 1/2 
1 0 1/2 0 1/2 
1/4 1/2 
9.6b Dependent. 
9.8a z 0 1 


i) 
ow 


pz(z) ¢ 
9.8b Zz —2 


Ale 
Ale 
SO alr 


=! 1 2 3 


1 a2 22 2 21 


px) 3 § T7233 
9.9a Fx(x) = 1—e-** for > 0 and 
Fy(y)=1-—e% fory>0. 
9.9b f(x,y) = 2e°?**™ for > 0 and 
y>0. 


9.9c fx(x) = 2e 
e ’ for y>0. 


9.9d Independent. 

9.10a 41/720. 

9.10b F(a,b) = 2a7b? + 20°b*. 
9.10c Fx(a) =a? 

9.10d fx(x) = 2a forO<a<1. 
9.10e Independent. 

9.11 27/50. 

9.13a 1/7. 

9.13b Fr(r) =r? forO<r<1. 
9.13¢ fx(x) = 2V1— 2? = fy(a) for 
x between —1 and 1. 


9.15a Since F(a,b) = “SSS 
where L(a,b) is the set of points (zx, a 
for which x < a and y < b, one needs to 
calculate the areas for the various cases. 
9.15b f(x,y) = 2 for (x,y) € A, and 
f(x,y) = 0 otherwise. 

9.15c Use the rule on page 122. 


9.19a a=5V2, b=4V2, and c= 18. 


2" + > O and fy(y) = 


area (AnLI(a, 8) 


438 C Answers to selected exercises 


9.19b Use that — Lane 3 (HH) is 
the probability density function of an 
N(u, 07) distributed random variable. 


9.19¢ N(0,1/36). 
10.la Cov(X,Y) = 
correlated. 


10.1b p(X,Y) = 0.0503. 
10.2a E[XY]=0. 

10.2b Cov(X,Y) =0. 
10.2c Var(X + Y) =4/3. 
10.2d Var(X —Y) = 4/3. 


0.142. Positively 


10.5a 
a 
b 0 1 2 
0 8/72 6/72 10/72 1/3 
1 12/72 9/72 15/72 1/2 
2 4/72 3/72 5/72 1/6 
1/3 1/4 5/12 1 


10.5b E[X] = 13/12, E[Y] = 5/6, and 
Cov(X,Y) = 0. 

10.5c Yes. 

10.6a E[X] = E[Y] =0 and 
Cov(X,Y) =0. 

10.6b E[X] =E[Y] =c; E[XY] =’. 
10.6c No. 

10.7a Cov(X,Y) = -1/8. 

10.7b p(X,Y) = —-1/2. 

10.7 ¢ For € equal to 1/4, 0 or —1/4. 
10.9a P(X; = 1) = (1—0.001)*° = 0.96 
and P(X; = 41) = 0.04. 

10.9b E[X;] = 2.6 and 

E[X1 +--+ + X25] = 65. 

10.10a E[X] = 109/50, 

E[Y] = 157/100, and E[X + Y] = 15/4. 
10.10b E[X?] = 1287/2590, 

E[Y?] = 318/125, and 

E[X + Y] = 3633/250. 

10.10¢ Var(X) = 989/2500, 

Var(Y) = 791/10 000, and 

Var(X + Y) = 4747/10 000. 


10.14a Use the alternative expression 
for the covariance. 


10.14b Use the alternative expression 
for the covariance. 

10.14c Combine parts a and b. 
10.16a Var(X) + Cov(X,Y). 

10.16 b Anything can happen. 

10.16c X and X+Y are positively cor- 
related. 

10.18 Solve 0 = N(N—1)(N+1)/124+ 
N(N — 1)Cov(X1, X2). 


11.1a Check that for k between 2 and 6, 
the summation runs over = 1,...,k—1, 
whereas for k between 7 and 12 it runs 
over (= k—6,...,12. 


11.1 b Check that for 2 << k < N, the 
summation runs over & = 1,...,k —1, 
whereas for k between N + 1 and 2N it 
runs over £=k—N,...,2N. 


11.2a Check that the summation runs 
over = 0,1,...,k. 


11.2b Use that A* fy’ /(A+y)* is equal 
to eal —p)*“, with p = w/(A+ yp). 
11.4a E[Z] = —3 and Var(Z) = 81. 
11.4b Z has an N(—3,81) distribution. 
11.4¢ P(Z <6) =0.8413. 


11.5 Check that for 0 < z < 1, the in- 
tegral runs over 0 < y < z, whereas for 
1<z< 2, it runs over z—-1l<y<l. 


11.6 Check that the integral runs over 
O<sy<z. 


11.7 Recall that a Gam(k, A) random 
variable can be represented as the sum of 
k independent Exp (A) random variables. 


11.9a fz(z)= 3(4 = =) fo 2 > 1, 
ai 1 1 

11.96 fal) = 5g (ae - ze) 

for z> 1. 


12.1e 1: no, 2: no, 3: okay, 4: okay, 5: 
okay. 


12.5a 0.00049. 


12.5b 1 (correct to 8281 decimals). 
12.6 0.256. 

12.7a \ 0.192. 

12.7b 0.1583 is close to 0.147. 

12-7 @ 2710. 

12.8a E[X(X —1))=p?. 

12.8b Var(X) = wu. 

12.11 The probability of the event in 
the hint equals (As)"e~*?8 /(k!(n — k)!). 
12.14a Note: 1—1/n — land1/n— 0. 
12.14b E[X,] = (1—1/n)-0+4+ (1/n)- 
in = 7. 

13.2a E[X;] = 0 and Var(X;) = 1/12. 
13.2b 1/12. 

13.4a n> 63. 

13.4b n> 250. 

13.4c¢ n> 125. 

13.4d n> 240. 


13.6 Expected income per game € 1/37; 
per year: € 9865. 


13.8a Var(Yn/2h) = 0.171/h/n. 
13.8b n> 801. 


13.9a T), is the average of a sequence of 
independent and identically distributed 
random variables. 


13.9b a=E[X?] = 1/3. 


13.10a P(|M, —1| >¢) = (1—€«)” for 
O<e<l. 


13.10 b No. 
14.2 0.9977. 
14.3 17. 
14.4 1/2. 


14.5 Use that X has the same probabil- 
ity distribution as X; + Xe +---+ Xn, 
where Xj, X2,...,Xn are independent 
Ber(p) distributed random variables. 
14.6a P(X < 25) = 0.5, P(X < 26) = 
0.6141. 

14.6b P(X < 2) 20. 

14.9a 5.71%. 

14.9b Yes! 


C Answers to selected exercises 439 


14.10a 91. 


14.10b Use that (M, — c)/o has an 
N(0,1) distribution. 
15.3a 
Bin 
(0,250] 
(250,500] 
(500,750] 
(750,1000] 
1000,1250 
1250,1500] 0.00004 


( ] 
( ] 
(1500,1750] 0.00004 
(1750,2000] 0 
( ] 
( ] 


Height 


0.00297 
0.00067 
0.00015 
0.00008 
0.00002 


2250,2500] 0 
2250,2500] 0.00002 


15.3b Skewed. 


0.003 


0.002 


0.001 


0 500 1000 1500 2000 2500 


15.4a 


Bin 
[0,500] 
(500,1000} 
1000,1500 
1500,2000 
2000,2500 


Height 


0.0012741 
0.0003556 
0.0001778 
0.0000741 
0.0000148 
2500,3000] 0.0000148 
3000,3500] 0.0000296 
3500,4000] 0 

4000,4500] 0.0000148 
4500,5000] 0 

5000,5500] + 0.0000148 
5500,6000] 0.0000148 
6000,6500] + 0.0000148 


440 C Answers to selected exercises 
15.4b 
t F(t) t F,(t) 
0 0 3500 0.9704 
500 0.6370 4000 0.9704 
1000 0.8148 4500 0.9778 
1500 0.9037 5000 0.9778 
2000 0.9407 5500 0.9852 
2500 0.9481 6000 0.9926 
3000 0.9556 6500 1 


15.4c Both are equal to 0.0889. 


15.5 
Bin Height 
(0,1] 0.2250 
(1,3] 0.1100 
(3,5] 0.0850 
(5,8] 0.0400 
(8,11) 0.0230 
(11,14] 0.0350 
(14, 18] 0.0225 


15.6 F,(7) =0.9. 


15.11 Use that the number of 2; in 
(a, b] equals the number of x; < b minus 
the number of x; <a. 


15.12a Bring the integral into the sum, 
change the integration variable to u = 
(t — x;)/h, and use the properties of ker- 
nel functions. 

15.12 b Similar to a. 

16.1la Median: 290. 


16.1b Lower quartile: 81; upper quar- 
tile: 843; IQR: 762. 


16.1c 144.6. 


16.3a Median: 70; lower quartile: 66.25; 
upper quartile: 75. 


16.3b 


16.3c Note the position of 31 in the 
boxplot. 


16.4a Yes, they both equal 7.056. 
16.4b Yes. 

16.4c Yes. 

16.6a Yes. 

16.6 b In general this will not be true. 
16.6c Yes. 

16.8 MAD is 3. 


16.10a The sample mean goes to infin- 
ity, whereas the sample median changes 
to 4.6. 


16.10b At least three elements need to 
be replaced. 


16.10c For the sample mean only one; 
for the sample median at least |(n+1)/2| 
elements. 


16.12 Z, = (N+ 1)/2; Med, = (N+ 
1)/2. 
16.15 Write (a; — En)? = x7 — 2px t 


=2 
Xn: 


17.1 
N(3,1) N(0,1) —-N(0,1) 
N(3,1) Exp(1/3) Exp (1) 
N(0,1) N(0,9) — Exp(1) 
N(3,1) N(0,9)  Exp(1/3) 
N(0,9) Exp(1/3)  Exp(1) 


17.2 
Exp(1/3) N(0,9) Exp(1/3) 
N(0,1) N(3,1) Exp(1) 
N(0,9) (0,9) N(3,1) 
Exp(1) N(3,1) Exp(1) 
N(0,1) N(0,1) Exp(1/3) 


17.3a Bin(10,p). 

17.3b p= 0.435. 

17.5a One possibility is p = 93/331; an- 
other is p = 29/93. 

17.5b p = 474/1285 or p = 198/474. 


17.5c 0.6281 or 0.6741 for smokers and 
0.7486 or 0.8026 for nonsmokers. 


17.7a An exponential distribution. 
17.7b One possibility is 4 = 0.00469. 


17.9a Recall the formula for the vol- 
ume of a cylinder with diameter d (at 
the base) and height h. 


17.9b Z, = 0.3022; y/Z = 0.3028; least 
squares: 0.3035. 


18.1 5° = 15625. Not equally likely. 
18.3a 0.0574. 

18.3b 0.0547. 

18.3c 0.000029. 

18.4a 0.3487. 

18.4b (1—1/n)”. 

18.5 values 0, +1, +2, and +3 with 
probabilities 7/27, 6/27, 3/27, and 1/27. 
18.7 Determine from which parametric 
distribution you generate the bootstrap 


datasets and what the bootstrapped ver- 
sion is of X, — p. 


18.8a Determine from which F' you 
generate the bootstrap datasets and 
what the bootstrapped version is of Xp — 
LL. 

18.8b Similar to a. 

18.8c Similar to a and b. 

18.9 Determine which normal distribu- 
tion corresponds to Xj, X3,...,X; and 
use this to compute P(|X; — u*| > 1). 


C Answers to selected exercises 441 


19.1a First show that E[X7] = 07/3, 
and use linearity of expectations. 
19.1b VT has negative bias. 

19.3 a=1/n, b=0. 

19.5 c=n. 


19.6 a Use linearity of expectations and 
plug in the expressions for E[M,] and 
E [Xn]. 


19.6b (nM, — Xn)/(n— 1). 
19.6c Estimate for 6: 2073.5. 


19.8 Check that E[Y;] = 6x; and use 
linearity of expectations. 


20.2a We prefer T. 


20.2b If a < 6 we prefer T; if a > 6 we 
prefer S. 


20.3 7}. 

20.4a E/[3L—1]=3E[N+1-—M]-1=N. 
20.4b (N+1)(N — 2)/2. 

20.4c 4 times. 


20.7 Var(T1) = (4— 6?)/n and 
Var(T2) = 0(4 — 0)/n. We prefer To. 


20.8a Use linearity of expectations. 
20.8b Differentiate with respect to r. 


20.11 MSE(T1) = 0?/(<"_, 23), 
MSE(T2) = (0? /n?) - 7", (1/22), 


MSE(Ts) = 0? n/(S0"_, 2)?. 
21.1 Dao. 
21.2 p=1/4. 


21.4a Usethat X1,...,X» are indepen- 
dent Pois(j) distributed random vari- 
ables. 


21.4b C(u) = (OL, zs) In(u 
—In(a1!-a2!---+ an!) — np, fi = Fn. 


21.4c ee, 
21.5a Zn. 


442 C Answers to selected exercises 
21.8a L(0) = gear: (24+) - 0”. 
(1 — 6)181°; @(@) = In(C) — 3839 In(4) 4 
1997 In(2 + 6) + 32 In(@) + 1810 In(1 — @). 
21.8b 0.0357. 
21.8c¢ (—b+VD)/(2n), with b = —ni+ 
n2+2n3+2na, and D = (n1—n2—2n3 
2na)” + 8nne. 


21.9 @=2(1) and B = Ln): 

21.11a 1/Zn. 

21.11 b yn). 

22.la &=2.35, 8 = —0.25. 

22.1b r1 = —0.1, ro = 0.2, r3 = —0.1. 
22.1c The estimated regression line 
goes through (0, 2.35) and (3, 1.6). 

22.5 Minimize 1°”, (yi — Bai)’. 

22.6 2218.45. 

22.8 The model with no intercept. 
22.10a & = 7/3, 6 = —1, A(a,8) = 
4/3. 

22.10b 17/9 <a <7/3, a=2. 
22.10c a=2, 8=-1. 


22.12a Use that the denominator of B 
and that 5> x; are numbers, not random 
variables. 


22.12b Use that E[Yi] = a+ Gai. 
22.12c Simplify the expression in b. 
22.12d Combine a and c. 

23.1 (740.55, 745.45). 

23.2 (3.486, 3.594). 

23.5a (0.050, 1.590). 

23.5 b See Section 23.3. 

23.5c (0.045, 1.600). 


23.6a Rewrite the probability in terms 
of Ly and Un. 


23.6 b (She 7,300 +7): 


23.6c Ln = 1—Upn and Un, =1—- In. 
The confidence interval: (—4, 3). 


23.6d (0,25) is a conservative 95% con- 
fidence interval for 0. 


23.7 (e *,e~7) = (0.050, 0.135). 
23.1la Yes. 


23.11 b Not necessarily. 

23.11c¢ Not necessarily. 

24.1 (0.620, 0.769). 

24.4a 609. 

24.4b No. 

24.6 a (1.68, 00). 

24.6 b [0, 2.80). 

24.8a (0.449, 0.812). 

24.8b (0.481, 1]. 

24.9a See Section 8.4. 

24.9b c) = 0.779, cy = 0.996. 

24.9c¢ (3.013, 3.851). 

24.9d (m/(1 —a/2)/",m/(a/2)/"). 
25.2 Hy: p> 1472. 

25.4a The difference or the ratio of the 
average numbers of cycles for the two 
groups. 

25.4b The difference or the ratio of 
the maximum likelihood estimators p1 
and fe. 

25.4c Hi: pi < po. 

25.5a Relevant values of T; are in [0, 5]; 


those close to 0, or close to 5, are in favor 
of Ay 2 


25.5 b Relevant values of T> are in [0, 5); 
only those close to 0 are in favor of Hy. 


25.6a The p-value is 0.23. Do not reject. 


25.6b The p-value is 0.77. Do not re- 
ject. 


25.6c The p-value is 0.968. Do not re- 
ject. 


25.6d The p-value is 0.019. Reject. 
25.6e The p-value is 0.99. Do not reject. 


25.6f The p-value is smaller than 0.019. 
Reject. 


25.6g The p-value is smaller than 
0.200. We cannot say anything about re- 
jection of Ho. 


25.10a My: uw > 23.75. 
25.10b The p-value is 0.0344. 
25.11 0.0456. 


26.3a 0.1. 
26.3b 0.72. 


26.5a The p-value is 0.1050. Do not re- 
ject Ho; this agrees with Exercise 24.8 b. 


26.5b K = {16,17,...,23}. 

26.5c 0.0466. 

26.5d 0.6950. 

26.6a Right critical value. 

26.6b Right critical value c = 1535.1; 
critical region [1536, oo). 

26.8 a For T we find K = (0,c] and for 
T’ we find K’ = [cu, 1). 

26.8b For T we find K = (0,¢] U 
[cu,00) and for T’ we find K’ = (0, cj] U 
[c., 1). 

26.9a For T we find K = [cy,0o) and 
for T’ we find K’ = [cj,0) U (0, c),]. 
26.9b For T we find K = [cu,co) and 
for T’ we find K’ = (0,c{,]. 
27.2a Ho : pw = 2550 and Ay 
2550. 

27.2b t = 1.2096. Do not reject Ho. 


a 


C Answers to selected exercises 443 


27.5a Ho: uw=0; Mi: w> 0; t=0.70. 


27.5b p-value: 0.2420. Do not reject 
Ao. 


27.7a Ho: 8=Oand H,:6 <0; 
ty = —20.06. Reject Ho. 


27.7b Same testing problem; 
tp = —11.03. Reject Ho. 


28.la Ho: wi = pe and Ay: pi F p2; 
tp = —2.130. Reject Ho. 


28.1b Ho: wi = pe and Ay: pi F p29; 
ta = —2.130. Reject Ho. 


28.1c¢ Reject Ho. The salaries differ sig- 
nificantly. 


28.3a tp = 2.492. Reject Ho. 
28.3b Reject Ho. 
28.3c tq = 2.463. Reject Ho. 
28.3d Reject Ho. 


28.5a Determine E [ask + bsy], using 
that S% and $% are both unbiased for o”. 
28.5 b Determine E [aS + (1—a)S¥], 
using that S% and S$? are independent, 
and minimize over a. 


D 


Full solutions to selected exercises 


2.8 From the rule for the probability of a union we obtain P(D, U D2) < P(Di) + 
P(D2) = 2- 107°. Since D; M D2 is contained in both D; and D2, we obtain 
P(DiM Dz) < min{P(D1),P(D2)} = 107°. Equality may hold in both cases: for 
the union, take D,; and D2 disjoint, for the intersection, take D; and D2 equal to 
each other. 


2.12a This is the same situation as with the three envelopes on the doormat, but 
now with ten possibilities. Hence an outcome has probability 1/10! to occur. 


2.12b For the five envelopes labeled 1,2,3,4,5 there are 5! possible orders, and 
for each of these there are 5! possible orders for the envelopes labeled 6, 7,8, 9, 10. 
Hence in total there are 5! - 5! outcomes. 


2.12c There are 32-5!-5! outcomes in the event “dream draw.” Hence the probability 
is 32- 5!5!/10! = 32-1-2-3-4-5/(6-7-8-9-10) = 8/63 =12.7 percent. 


2.14a Since door a is never opened, P((a,a)) = P((b,a)) = P((c,a)) = 0. If the can- 
didate chooses a (which happens with probability 1/3), then the quizmaster chooses 
without preference from doors b and c. This yields that P((a,b)) = P((a,c)) = 1/6. 
If the candidate chooses b (which happens with probability 1/3), then the quizmas- 
ter can only open door c. Hence P((b,c)) = 1/3. Similarly, P((c, b)) = 1/3. Clearly, 
P((b, b)) = P((c,c)) =0. 

2.14b If the candidate chooses a then she or he wins; hence the corresponding 
event is {(a, a), (a,b), (a,c)}, and its probability is 1/3. 


2.14c To end with a the candidate should have chosen b or c. So the event is 
{(b,c), (c,b)} and P({(b, c), (c, b)}) = 2/3. 

2.16 Since EN FONG =9, the three sets ENF, FAG, and ENG are disjoint. 
Since each has probability 1/3, they have probability 1 together. From these two 
facts one deduces P(E) = P(EN F)+P(EN G) = 2/3 (make a diagram or use that 
BE=EN(ENF)UEN(FNAG)UEN(ENG)). 

3.1 Define the following events: B is the event “point B is reached on the second 
step,” C is the event “the path to C is chosen on the first step,” and similarly we 
define D and EF. Note that the events C’, D, and EF are mutually exclusive and that 
one of them must occur. Furthermore, that we can only reach B by first going to C 


446 D Full solutions to selected exercises 


or D. For the computation we use the law of total probability, by conditioning on 
the result of the first step: 


P(B) = P(BNC)+P(BND)+P(BNE£) 
= P(B|C) P(C) + P(B| D) P(D) + P(B| E) P(E) 
ME Voy de de pe ot 
=3°3¢a°3 +3 °= oe 
3.2a Event A has three outcomes, event B has 11 outcomes, and AN B = 
{(1,3), (3, 1)}. Hence we find P(B) = 11/36 and P(AN B) = 2/36 so that 


_ P(ANB) _ 2/36 — 2 
P(A|B) = Sop) = T1736 7 11" 
3.2b Because P(A) = 3/36 = 1/12 and this is not equal to 2/11 = P(A|B) the 
events A and B are dependent. 


3.3a There are 13 spades in the deck and each has probability 1/52 of being chosen, 
hence P(S;) = 13/52 = 1/4. Given that the first card is a spade there are 13—1 = 12 
spades left in the deck with 52 — 1 = 51 remaining cards, so P(.S2| $1) = 12/51. If 
the first card is not a spade there are 13 spades left in the deck of 51, so P(S2| ST) = 
13/51. 


3.3b We use the law of total probability (based on 2 = S; U S‘): 


P(S2) = P(S27 $1) + P(S2N ST) = P(S2 | $1) P(S1) + P(S2 | S{) P(S7) 
_— 12 1, 138 3 12439 = 1 
—ai f° 51 4 Sie a 
3.7a The best approach to a problem like this one is to write out the conditional 
probability and then see if we can somehow combine this with P(A) = 1/3 to 
solve the puzzle. Note that P(BM A°) = P(B|A°)P(A°) and that P(AUB) = 
P(A) + P(BN A’). So 


PduB)=F4+9-(1 +)=34 


1 


ES 
6 2 


3.7b From the conditional probability we find P(A°M B°) = P(A°|B°)P(B*) = 
3 (1—P(B)). Recalling DeMorgan’s law we know P(A°N B°) = P((AUB)*) = 
1—P(AU B) = 1/3. Combined this yields an equation for P(B): 5 (1 — P(B)) = 1/3 
from which we find P(B) = 1/3. 

3.8a This asks for P(W). We use the law of total probability, decomposing 2 = 
FUF®. Note that P(W | F’) = 0.99. 


P(W) =P(WNF)+P(W F*) = P(W | F) P(F) + P(W | F*) P(P*) 
= 0.99 - 0.1 + 0.02 - 0.9 = 0.099 + 0.018 = 0.117. 


3.8b We need to determine P(F'| W), and this can be done using Bayes’ rule. Some 
of the necessary computations have already been done in a, we can copy P(W 1 F’) 
and P(W) and get: 


P(F OW) _ 0.099 


PRIW) = So = o.ti7 


= 0.846. 


D Full solutions to selected exercises 447 


4.1a In two independent throws of a die there are 36 possible outcomes, each 
occurring with probability 1/36. Since there are 25 ways to have no 6’s, 10 ways to 
have one 6, and one way to have two 6’s, we find that pz(0) = 25/36, pz(1) = 10/36, 
and pz(2) = 1/36. So the probability mass function pz of Z is given by the following 
table: 

z 0 1 2 


25 10 1 
pz(z) 364 360«OBG 


The distribution function F’z is given by 


0 fora <0 

25 

= < 
Fz(a) = OE 10 _ 35 ee 

36 + 36 = 36 or sa< 

25 , 10 ie 

34 +a¢ tag =1 foraa2. 


Z is the sum of two independent Ber(1/6) distributed random variables, so Z has 
a Bin(2,1/6) distribution. 

4.1b If we denote the outcome of the two throws by (7,7), where i is the out- 
come of the first throw and j the outcome of the second, then {M = 2,7 = 0} = 
{ (2,1), (1,2), (2,2) }, {S =5,7 = 1} =0,{S =8,Z = 1} = { (6,2), (2,6) }. Fur 
thermore, P(M = 2, Z = 0) = 3/36, P(S =5,Z =1) = 0, and P(S =8,Z=1) = 
2/36. 


4.1c The events are dependent, because, e.g., P(M = 2,7 =0) = = differs from 
P(M =2)-P(Z=0)= 3-33. 
4.10a Each R; has a Bernoulli distribution, because it can only attain the values 0 
and 1. The parameter is p = P(R; = 1). It is not easy to determine P(R; = 1), but 
it is fairly easy to determine P(R; = 0). The event {R; = 0} occurs when none of 
the m people has chosen the ith floor. Since they make their choices independently 
of each other, and each floor is selected by each of these m people with probability 


1/21, it follows that 
20\™ 
Pino) 
(Ri = 0) (3) 


Now use that p = P(R; = 1) = 1 — P(R; = 0) to find the desired answer. 


4.10b If {Ri = 0},..., {R20 = 0}, we must have that {R21 = 1}, so we cannot 
conclude that the events {Ri = ai},..., {R21 = a2i}, where a; is 0 or 1, are indepen- 
dent. Consequently, we cannot use the argument from Section 4.3 to conclude that 
Sm is Bin(21,p). In fact, Sm is not Bin(21,p) distributed, as the following shows. 
The elevator will stop at least once, so P(S;, = 0) = 0. However, if S;, would have 
a Bin(21, p) distribution, then P(Sm = 0) = (1—p)?" > 0, which is a contradiction. 


4.10c This exercise is a variation on finding the probability of no coincident birth- 
days from Section 3.2. For m = 2, Sz = 1 occurs precisely if the two persons entering 
the elevator select the same floor. The first person selects any of the 21 floors, the 
second selects the same floor with probability 1/21, so P(S2 = 1) = 1/21. For m = 3, 
S3 = 1 occurs if the second and third persons entering the elevator both select the 
same floor as was selected by the first person, so P(S3 = 1) = (1/21)? = 1/441. 
Furthermore, $3 = 3 occurs precisely when all three persons choose a different floor. 
Since there are 21 - 20-19 ways to do this out of a total of 21° possible ways, we 


448 D Full solutions to selected exercises 


find that P(S3 = 3) = 380/441. Since S3 can only attain the values 1, 2,3, it follows 
that P(S3 = 2) = 1—P(S3 = 1) — P(S3 = 3) = 60/441. 

4.13a Since we wait for the first time we draw the marked bolt in independent 
draws, each with a Ber(p) distribution, where p is the probability to draw the bolt 
(so p = 1/N), we find, using a reasoning as in Section 4.4, that X has a Geo(1/N) 
distribution. 


4.13b Clearly, P(Y =1) = 1/N. Let D; be the event that the marked bolt was 
drawn (for the first time) in the 7th draw. For k = 2,...,.N we have that 


P(Y =k) = P(DiN---N Dg_1N Dx) 
= P(Dg | DIN---N Dg_-1)- P(DIN--» 9 Dg_1)- 
Now P(D; | DE N---A Dé&_1) = 


— 
N—k+1? 


P(DUN--- Dg—-1) = P(De-1 | DI N--- Dg_-a) > P(DIN- ++ Dg_a), 


and 
1 
P(Dy_1| DI N-++A Dg_1) = 1— P(Dg-1 | DIN- +N Dg_1) = 1 — ———-. 
(Dy-1| Di k-1) (De-1| Di hi) N—k+2 
Continuing in this way, we find after k steps that 


il N-k+1 N-k+2 N-2 N-1 1 
P(t = Py hy Be 
( ) N-k+1 N-k+2 N—-k+3 N-1 N N 


See also Section 9.3, where the distribution of Y is derived in a different way. 


4.13c For k =0,1,...,r, the probability P(Z = k) is equal to the number of ways 
the event {Z = k} can occur, divided by the number of ways i ) we can select r 
objects from N objects, see also Section 4.3. Since one can select k marked bolts 
from m marked ones in He) ays, and r—k nonmarked bolts from N—m nonmarked 
ones in ie —1") ways, it follows that 


m\ (N-m 
Glee) 
) 
5.4a Let T be the time until the next arrival of a bus. Then T has U(4,6) distri- 
bution. Hence P(X < 4.5) = P(T <4.5) = f° 1/2de = 1/4. 


5.4b Since Jensen leaves when the next bus arrives after more than 5 minutes, 
PC = 5) =P? > 5) = fF dde= 1/2. 


5.4c Since P(X = 5) =0.5 > 0, X cannot be continuous. Since X can take any of 
the uncountable values in [4,5], it can also not be discrete. 


P(Z =k) = for k = 0,1,2,...,r. 


5.8a The probability density g(y) = 1/(2,/ry) has an asymptote in 0 and decreases 
to 1/2r in the point r. Outside [0,7] the function is 0. 


5.8b The second darter is better: for each 0 < b <r one has (b/r)? < \/b/r so the 


second darter always has a larger probability to get closer to the center. 


5.8c Any function F that is 0 left from 0, increasing on [0,7r], takes the value 0.9 
in r/10, and takes the value 1 in r and to the right of r is a correct answer to this 
question. 


D Full solutions to selected exercises 449 


5.13a This follows with a change of variable transformation x ++ —z in the integral: 
®(—a) = fae o(x) dx = fig o(—x) da = | o(x) dx = 1— ®(a). 

5.13b This is straightforward: P(Z < —2) = ®(—2) = 1 — (2) = 0.0228. 

6.5 We see that 


X<a & -InU<a S&S WU>S>-a S&S Use", 
and so P(X <a) = PU > e~*) =1- P(U < e*) = 1—e , where we use 
P(U < p) =p for 0 < p< 1 applied to p = e~* (remember that a > 0). 
6.7 We need to obtain F™’, and do this by solving F(x) = u, for 0 <u< 1: 


l-e "=u © @' =1-u « 5a” = In(1 — wu) 
& «2 =-02m(1-u) © «= /-02In(1—1u). 


The solution is Z = /—0.2InU (replacing 1 — U by U, see Exercise 6.3). Note that 
Z? has an Exp(5) distribution. 


6.10a Define random variables B; = 1 if Ui < p and B; = 0 if U; > p. Then 
P(B; = 1) = p and P(B; = 0) = 1 — p: each B; has a Ber(p) distribution. If By = 
Bog =--- = Br-i1 = 0 and By = 1, then N = k, ie., N is the position in the 
sequence of Bernoulli random variables, where the first 1 occurs. This is a Geo(p) 
distribution. This can be verified by computing the probability mass function: for 
ki, 


P(N =k) = P(B, = By =--- = By_-1 = 0, By = 1) 
= P(B, = 0) P(B2 = 0) --- P(By-1 = 0) P(B, = 1) 
=(1—p)""'p. 


6.10b If Y is (a real number!) greater than n, then rounding upwards means we 
obtain n + 1 or higher, so {Y > n} = {Z > n+1} = {Z > n}. Therefore, 
P(Z>n)=P(Y >n) =e" = (e~)”. From = —In(1—p) we see: e~* = 1—p, 
so the last probability is (1 — p)”. From P(Z > n—1) = P(Z=n)+P(Z>n) we 
find: P(Z =n) =P(Z >n-—1)—-P(Z>n)=(1—-p)” '—(1—p)" =(1—p)™ 'p. 
Z has a Geo(p) distribution. 


6.12 We need to generate stock prices for the next five years, or 60 months. So we 
need sixty U(0,1) random variables U1, ..., Ueo. Let S; denote the stock price in 
month 7, and set So = 100, the initial stock price. From the U; we obtain the stock 
movement, as follows, for 7 = 1,2,...: 


0.95S;-1 ifUj< 0.25, 
Si = ¢ Si-1 if 0.25 < U; < 0.75, 
1.05. S;-1 if U; > 0.75. 


We have carried this out, using the realizations below: 


1-10: 0.72 0.03 0.01 0.81 0.97 0.31 0.76 0.70 0.71 0.25 
11-20: 0.88 0.25 0.89 0.95 0.82 0.52 0.37 0.40 0.82 0.04 
21-30: 0.38 0.88 0.81 0.09 0.36 0.93 0.00 0.14 0.74 0.48 
31-40: 0.34 0.34 0.387 0.30 0.74 0.03 0.16 0.92 0.25 0.20 
41-50: 0.37 0.24 0.09 0.69 0.91 0.04 0.81 0.95 0.29 0.47 
51-60: 0.19 0.76 0.98 0.31 0.70 0.36 0.56 0.22 0.78 0.41 


450 D Full solutions to selected exercises 


We do not list all the stock prices, just the ones that matter for our investment 
strategy (you can verify this). We first wait until the price drops below € 95, which 
happens at S41 = 94.76. Our money has been in the bank for four months, so we own 
€ 1000 - 1.0054 = € 1020.15, for which we can buy 1020.15/94.76 = 10.77 shares. 
Next we wait until the price hits € 110, this happens at Si5 = 114.61. We sell the 
our shares for € 10.77 - 114.61 = € 1233.85, and put the money in the bank. At 
S42 = 92.19 we buy stock again, for the € 1233.85 - 1.0057” = € 1411.71 that has 
accrued in the bank. We can buy 15.31 shares. For the rest of the five year period 
nothing happens, the final price is S69 = 100.63, which puts the value of our portfolio 
at € 1540.65. 

For a real simulation the above should be repeated, say, one thousand times. The 
one thousand net results then give us an impression of the probability distribution 
that corresponds to this model and strategy. 


7.6 Since f is increasing on the interval [2,3] we know from the interpretation of 
expectation as center of gravity that the expectation should lie closer to 3 than to 2. 
The computation: E[Z] = i aedz= [S-4]3 =22. 
7.15a We use the change-of-units rule for the expectation twice: 
Var(rX) = E[(rX — E[rX]?)] =E[(rX — rE[X])’] 
= E[r?(X — E[X])?] =r°E[(X — E[X])?] =r?Var(X). 
7.15 b Now we use the change-of-units rule for the expectation once: 
Var(X +s) = E[((X +s) —E[X +s])’] 
= E[((X +s) — E[X] + 8)?] =E[(X — E[X])?] = Var(X). 
7.15c With first b, and then a: Var(rX +s) = Var(rX) =r?Var(X). 


7.17a Since a; > 0 and p; > O it must follow that aip1 +---+a;p, > 0. So 
0 = E[U] = aip1 +---+arpr > 0. As we may assume that all p; > 0, it follows that 
Qa, =a2=:::=a,=0. 
7.17b Let m = E[V] = pibi+:--+p,br. Then the random variable U = (V—-E[V])? 
takes the values a1 = (b1 —m)?,...,ar = (be —m)?. Since E[U] = Var(V) = 0, part 
a tells us that 0 = a1 = (bi — m)?,...,0 = ar = (br — m)?. But this is only 
possible if b} = m,...,b, = m. Since m = E[V], this is the same as saying that 
P(V =E[V]) =1. 
8.2a First we determine the possible values that Y can take. Here these are —1,0, 
and 1. Then we investigate which x-values lead to these y-values and sum the prob- 
abilities of the x-values to obtain the probability of the y-value. For instance, 

P(Y =0) =P(X =2) + P(X =4) + P(X =6) = 74 


De 
6 6 2 
Similarly, we obtain for the two other values 

e P(Y =1) =P(X = 1) + P(X =5) =, 


8.2b The values taken by Z are —1, 0, and 1. Furthermore 


P(Y =-1) =P(X =3) 


P(Z =0) = P(X =1) + P(X =3) + P(X =5) = 


1 1 
6 6 


and similarly P(Z = —1) = 1/3 and P(Z = 1) = 1/6. 


D Full solutions to selected exercises 451 


8.2¢ Since for any a one has sin?(a) + cos?(a) = 1, W can only take the value 1, 
so P(W =1)=1. 


8.10 Because of symmetry: P(X > 3) = 0.500. Furthermore: o? = 4, so a = 2. 
Then Z = (X —3)/2 is an N(0,1) distributed random variable, so that P(X <1) = 
P((X — 3)/2) < (1—3)/2 = P(Z < -1) = P(Z > 1) = 0.1587. 

8.11 Since —g is a convex function, Jensen’s inequality yields that —g(E[X]) < 
E[—g(X)]. Since E[—g(X)] = —E[g(X)], the inequality follows by multiplying both 
sides by —1. 

8.12a The possible values Y can take are /O = 0, V1 = 1, V100 = 10, and 
Vv 10000 = 100. Hence the probability mass function is given by 


y 0 1 10 100 
PY=y) aaa 4 


8.12 b Compute the second derivative: Vt = —ja-9/? < 0. Hence g(x) = —/z 
is a convex function. Jensen’s inequality yields that ,/E[X] > E [vx |: 


8.12¢ We obtain \/E[X] = \/(0 + 1 + 100 + 10000)/4 = 50.25, but 


E[Vx] = E[Y] = (0 +14 10 + 100)/4 = 27.75. 


8.19a This happens for all y in the interval [7/4, 7/2], which corresponds to the 
upper right quarter of the circle. 


8.19b Since {Z < t} = {X < arctan(t)}, we obtain 


Fz(t) =P(Z <t) = P(X < arctan(t)) = ; + - arctan(t). 


Tv 


8.19c Differentiating Fz we obtain that the probability density function of Z is 


fz(z) =  Fx(2) = < (5 + - arctan(2) — as for —0co < z< 00. 
9.2a From P(X =1,Y =1) = 1/2, P(X = 1) = 2/3, and the fact that P(X = 1) = 
P(X =1,Y =1) + P(X =1,Y =-), it follows that P(X =1,Y =—1) = 1/6. 
Since P(Y = 1) = 1/2 and P(X =1,Y =1) = 1/2, we must have: P(X =0,Y = 1) 
and P(X = 2,Y =1) are both zero. From this and the fact that P(X = 0) = 1/6 = 
P(X = 2) one finds that P(X =0,Y 1) =1/6=P(X =2,Y 1). 


9.2b Since, e.g., P(X = 2,Y = 1) = 0 is different from P(X = 2) P(Y = 1) = 
one finds that X and Y are dependent. 


9.8a Since X can attain the values 0 and 1 and Y the values 0 and 2, Z can attain 
the values 0, 1, 2, and 3 with probabilities: P(Z = 0) = P(X =0,Y =0) 1/4, 
P(Z = 1) P(X =1,Y =0) 1/4, P(Z = 2) P(X =0,Y = 2) 1/4, and 
P(Z = 3) = P(X =1,Y = 2) =1/4. 

9.8b Since X = Z— Y, X can attain the values —2, —1, 0, 1, 2, and 3 with 
probabilities 


452 D Full solutions to selected exercises 


P(x 2) P(Z 0,¥ 2) 1/8, 
P(x 1) P(Z 1,Y 2) 1/8, 
P(x 0) P(Z 0,¥ 0) +P(Z 2,¥ 2) 1/4, 
P(x 1) P(Z L,Y 0) +P(Z 3,¥ 2) 1/4 
P(x 2) P(Z 2,¥ 0) 1/8, 
P(x 3) P(Z oy 0) 1/8. 


We have the following table: 
Zz —2-=1 0 1 2 3 
pe(z) 1/8 1/8 1/4 1/4 1/8 1/8 
9.9a One has that Fx(x) = limy.. F(x,y). So for « < 0: Fx(x) = 0, and for 


x > 0: Fx(a#) = F(2,00) = 1—e7**. Similarly, Fy(y) = 0 for y < 0, and for y > 0: 
Fy(y) = F(oo,y) =1-e7¥. 


9.9b For x > 0 and y > 0: f(x,y) = 
Qe (2e+y) | 


2. = —(22 
5a F (2,9) = Z(e y _ 9 ta) = 


9.9c There are two ways to determine fx (x): 


x(x) = x,y) dy = e ?#t dy —%- 2" fora >0 
y) ay y 
—oo 0 


and 


fx(x) = < Fx(x) =2e?* forr>0. 


Using either way one finds that fy(y) =e" for y > 0. 


9.9d Since F(x,y) = Fx(x)Fy(y) for all x,y, we find that X and Y are indepen- 
dent. 


9.11 To determine P(X < Y) we must integrate f(x,y) over the region G of points 
(x,y) in R? for which « is smaller than y: 


P(X <Y)= /[ f(x,y) dx dy 
{(a,y)ER?; x<y} 


[ ([" fear) a= ([ Zyl +y) az) dy 


12 f' y 1D. f” 3 27 
a 1 dz ) dy = = 1 dy = —. 
5 A +u)( fear) ay 10 J, YF way 50 


Here we used that f(x,y) = 0 for (x,y) outside the unit square. 


9.15a Setting O(a,b) as the set of points (x,y), for which x < a and y < 6b, we 
have that 


_ area (AM U(a,b)) 
ce area of A ‘ 
e Ifa<0or if b <0 (or both), then area(AML(a,b)) = 0, so F(a, b) = 0, 


D Full solutions to selected exercises 453 


If (a,b) € A, then area(AM O(a, b)) = a(b— $a), so F(a, b) = a(2b— a), 
If0<b<1, anda>b, then area (AN H(a,b)) = $b’, so F(a,b) = 0b’, 

If0 <a <1, and b > 1, then area(ANU(a,b)) = a— 4a’, so F(a,b) = 2a—a’, 
If both a > 1 and b > 1, then area (A NM L(a, b) 


9.15b Since f(x,y) = “ F(a,y), we find for (a,y) € A that f(a,y) = 2. Fur- 


thermore, f(x,y) = 0 for (2, y) outside the triangle A. 
9.15c For x between 0 and 1, 


fea) = [~ fev)ay= [" 2¢y= 20-2), 


For y between 0 and 1, 


fr(y) -f fay)dy = f° 2de = 2y. 


10.6a When c = 0, the joint distribution becomes 


b at i 84 P(Y =b) 
=i 2/45 9/45 4/45 1/3 
0 7/45 5/45 3/45 1/3 
1 6/45 1/45 8/45 1/3 
P(X=a) 1/3 1/3 1/3 1 


We find E[X] = (-1)-4+0-4+1-4=0, and similarly E[Y] = 0. By leaving out 
terms where either X = 0 or Y = 0, we find 

2 4 6 8 
gtC)-l gti: (-1)- gtilgs-% 
which implies that Cov(X,Y) = E[XY] — E[X] E[Y] = 0. 

10.6b Note that the variables X and Y in part b are equal to the ones from part a, 
shifted by c. If we write U and V for the variables from a, then X = U +c and 
Y =V-+c. According to the rule on the covariance under change of units, we then 
immediately find Cov(X, Y) = Cov(U +c, V +c) = Cov(U,V) =0 

Alternatively, one could also compute the covariance from Cov(X,Y) = E[XY] — 
E[X]E[Y]. We find ELX] = (c—1)-4+e-$+(c+1)-4 =c, and similarly E[Y] = c. 
Since 


E[XY] = (-1)-(-1)- 


B[XY] =(e-1)-(e-1)- B+ (-1)-¢- B+ (+1) (+1) 4 
+c: (c-—1)- Ete c: ate (c+1)- “ 
He+1)-(e-1)- St (e+1)-e Et(e+l)- (e+) R=e, 


we find Cov(X,Y) = E[XY] —E[X] E[Y] =c? —c-c=0. 


454 D Full solutions to selected exercises 


10.6c No, X and Y are not independent. For instance, P(X = c, ¥Y =c+1) = 1/45, 
which differs from P(X =c) P(Y =c+1)=1/9. 


10.9a If the aggregated blood sample tests negative, we do not have to perform 
additional tests, so that X; takes on the value 1. If the aggregated blood sample 
tests positive, we have to perform 40 additional tests for the blood sample of each 
person in the group, so that X; takes on the value 41. We first find that P(X; = 1) = 
P(no infections in group of 40) = (1 — 0.001)*° = 0.96, and therefore P(X; = 41) = 
1— P(X; = 1) = 0.04. 

10.9b First compute E[X;] = 1-0.96+41-0.04 = 2.6. The expected total number of 
tests is E[X, + X2 +--+ + Xo5] = E[X1]+E[X2]+---+E[Xo5] = 25-2.6 = 65. With 
the original procedure of blood testing, the total number of tests is 25-40 = 1000. On 
average the alternative procedure would only require 65 tests. Only with very small 
probability one would end up with doing more than 1000 tests, so the alternative 
procedure is better. 


10.10a We find 


es 225 225 ¢ 20° 
love) 2 
1 [3 157 
E[Y] = dy= | —(3y? + 12y?)dy = — |=y* 3) == 
=f utay= [stu + 120") ay = 3 But tay" = 
so that ELX + Y] = E[X] + E[Y] = 15/4. 
10.10b We find 
on a 2 [9 7 4]? 1287 
2) 2 - _ 4 19,5, f 4 
E[X*] =f 2 fx(v)de = f ao (92 + 7x *\dr = 305 E +4 ii 50” 


a 7 1 [3s 2 318 
E[Y?] =f y’ fy (y) ay= | = (3y' + 12y*) dy = 35 [2y +34 = 
alt 1 


eixy| = [ [ xcyf (x,y) ayar= [ [ (2a? y? +a°y °) dy dx 


3 2 2 [ev 
a d = dy} d 
mh (fe) +5 | (fee : 


Ay f? 215 2 171 
=—- 3d¢74—— dz = — 
[e a ee fe Gi 


so that E[(X +Y)?] = E[X?] + E[Y?] + 2E[XY] = 3633/250. 
10.10c We find 


Var(X) = B[X?] — (Ex)? = 2287 (2) _ 989 


250 \ 50. 2500’ 
318 157\? 791 
Y) =E[Y*] -(E[y])? === -(—) =——_ 
Var(¥) = B[Y*] - cep? = 38 - (7) = 
7 ; —yy2 _ 3633. (15\? 939 
Var(X + Y) = E[(X + Y)*] - (E[X + Y]) 250 (2) ~ 2000° 


Hence, Var(X) + Var(Y) = 0.4747, which differs from Var(X + Y) = 0.4695. 


D Full solutions to selected exercises 455 


10.14a By using the alternative expression for the covariance and linearity of ex- 
pectations, we find 


Cov(X +s, Y +u) 
=E|(X +s)(Y + u)] —E[X +s] E[Y +4] 
=E[XY +sY +uX + su] — (E[X] + s)(E[Y] + u) 
= (E[XY]+ sE[Y] + vwE[X] + su) — (E[X] E[Y] + sE[Y] + wE[X] + su) 
E[XY]—-E[X]E[Y] 
= Cov(X,Y). 


10.14b By using the alternative expression for the covariance and the rule on 
expectations under change of units, we find 


Cov(rX,tY) = E[(rX)(tY)] — E[rX] E[ty] 
= E[rtXY] — (rE[X])(tE[Y]) 
= rtE[XY] — rtE[X]E[Y] 
= rt (E[XY] — E[X] E[Y]) 
= rtCov(X,Y). 


10.14c First applying part a and then part b yields 
Cov(rX + s,tY + u) = Cov(rxX,tY) = rtCov(xX,Y). 


10.18 First note that X1; + X2 +---+ Xw is the sum of all numbers, which is 
a nonrandom constant. Therefore, Var(X1 + X2+---+ Xn) = 0. In Section 9.3 
we argued that, although we draw without replacement, each X; has the same 
distribution. By the same reasoning, we find that each pair (Xi, X;), with 7 4 J, 
has the same joint distribution, so that Cov(X;,X;) = Cov(X1, X2) for all pairs 
with i 4 j. Direct application of Exercise 10.17 with o? = (N — 1)(N +1) and 
y = Cov(X1, X2) gives 


(N — 1)(N +1) 
12 
Solving this identity gives Cov(X1, X2) = —(N + 1)/12. 


11.2a By using the rule on addition of two independent discrete random variables, 
we have 


0 = Var(X1 + Xo+-:-+Xn)=N- + N(N — 1)Cov(X1, X2). 


P(X+Y =k) =o rx(k- £)py (2). 


Because px (a) = 0 for a < —1, all terms with : >k-+1 vanish, so that 
roe 


- 4.04 -a 
P(X +Y =k) = 2G Hie “We = = > (1) =F , 


= \ 


also using = (*) = 2* in the last equality. 


456 D Full solutions to selected exercises 


11.2b Similar to part a, by using the rule on addition of two independent discrete 
random variables and leaving out terms for which px (a) = 0, we have 


k k-£ k-£ ue 
pak » -\ be mite eo tH) A 
P(X+Y=k)=)5 or Fre a - iS ae 


Next, write 


APE t m £L rd k-£ m £ m k—-£ r - 
Ae eee) (pee oa a ees lene = p*(1 — p)*- 
(A+ h)* (x45) (x45) (45) ( a) pp) 


with p = 4/(A+ yp). This means that 


k ’ 
P(X+Y=k)= a. Bree Ot S~ (7) )pra ght (AS WYP -Otw), 
£=0 

using that Le (k)p*(1 =ip)P—* = 1. 

11.4a From the fact that X has an N(2,5) distribution, it follows that E[X] = 


2 and Var(X) = 5. Similarly, E[Y] = 5 and Var(Y) = 9. Hence by linearity of 
expectations, 


E[Z] = B[BX — 2Y¥ +1] = 3E[X] —2E[Y]41=3-2-2-54+1=-3. 


By the rules for the variance and covariance, 


Var(Z) = 9Var(X) + 4Var(Y) — 12Cov(X,Y) =9-54+4-9—12-0=81, 


using that Cov(X,Y) = 0, due to independence of X and Y. 


11.4b The random variables 3X and —2Y +1 are independent and, according to 
the rule for the normal distribution under a change of units (page 106), it follows 
that they both have a normal distribution. Next, the sum rule for independent 
normal random variables then yields that Z = (3X) + (—2Y +1) also has a normal 
distribution. Its parameters are the expectation and variance of Z. From a it follows 
that Z has an N(—3, 81) distribution. 


11.4c From b we know that Z has an N(—3,81) distribution, so that (Z + 3)/9 
has a standard normal distribution. Therefore 


P(Z <6) = °( < ae = (1), 


where ® is the standard normal distribution function. From Table B.1 we find that 
®(1) = 1 — 0.1587 = 0.8413. 


11.9a According to the product rule on page 160, 


[ fy (2) fx(o)zae = [ arr 


fz(z) 


D Full solutions to selected exercises 457 


11.9b According to the product rule, 


* ee D 
( Ta got x 


Zz 
x 


fz(z) 


II 
— 
SY 
i 
——~ 
Qe 
SF 
mH, 
S 
oh 
Ble 
QO 
8 
II 


af [* B-o-1 ae — ap [=] = ab oil ae 
1 


a— GB z6+1 


12.1e This is certainly open to discussion. Bankruptcies: no (they come in clusters, 
don’t they?). Eggs: no (I suppose after one egg it takes the chicken some time to 
produce another). Examples 3 and 4 are the best candidates. Example 5 could be 
modeled by the Poisson process if the crossing is not a dangerous one; otherwise 
authorities might take measures and destroy the homogeneity. 


12.6 The expected numbers of flaws in 1 meter is 100/40 = 2.5, and hence 
the number of flaws X has a Pois(2.5) distribution. The answer is P(X = 2) = 
3(2.5)7e7?> = 0.256. 

12.7 a It is reasonable to estimate » with (nr. of cars)/(total time in sec.) = 0.192. 
12.7b 19/120 = 0.1583, and if \ = 0.192 then P(N(10) = 0) = e~° 19719 — 0.147. 

12.7¢ P(N(10) = 10) with \ from a seems a reasonable approximation of this prob- 
ability. It equals e~'%? - (0.192 - 10)1°/10! = 2.71 - 107°. 

12.11 Following the hint, we obtain: 


P(N([0, 5] = k, N([0, 2s]) =n) = ,8]) =k, N((s,2s]) =n —k) 
,8]) = k) - P(N((s, 2s]) 
= (As)*e~**/(k!) « (As)” *e > /((n — k)!) 


) 
ds)"e 7° /(kI(n — k)!). 


w 


So 
P(N ([0, s]) = k, N([0, 28]) = n) 
P(N((0, 25) =n) 
= nl/(kI(n — k)!) - (As)"/(2As)” 
= n!/(k(n — k)!) - 1/2)". 
This holds for k = 0,...,n, so we find the Bin(n, 5) distribution. 
13.2a From the formulas for the U(a,b) distribution, substituting a = —1/2 and 
b = 1/2, we derive that E[X;] = 0 and Var(X;) = 1/12. 


13.2b We write S = X1 + X2+---+ Xio00, for which we find E[S] = E[Xi]+---+ 
E[X100] = 0 and, by independence, Var($) = Var(X1)+---+Var(X100) = 100-5 = 
100/12. We find from Chebyshev’s inequality: 


P(N((0, s]) = &| N([0, 2s]) =n) = 


Var(S) 1 
= = < =o 
P(|S| > 10) = P(|S — 0| > 10) < Te a 


458 D Full solutions to selected exercises 


13.4a Because X; has a Ber(p) distribution, E[Xi] = p and Var(X;) = p(1 — p), 
and so E[X,] = p and Var(X,) = Var(Xi) /n = p(1— p)/n. By Chebyshev’s 
inequality: 
¥ pU—p)/n _ 25p(1—p) 
P(|Xn > 0.2) < ———_. 
( PI ) ~ (0.2)? 2)? n 

The right-hand side should be at most 0.1 (note that we switched to the comple- 
ment). If p = 1/2 we therefore require 25/(4n) < 0.1, or n > 25/(4-0.1) = 62.5, 
ie., n > 63. Now, suppose p 4 1/2, using n = 63 and p(1 — p) < 1/4 we conclude 
that 25p(1 — p)/n < 25 - (1/4) /63 = 0.0992 < 0.1, so (because of the inequality) the 
computed value satisfies for other values of p as well. 


13.4b For arbitrary a > 0 we conclude from Chebyshev’s inequality: 


¢ p(—p)/n _ pd-p) - 1 
P(|Xn —p| >a) < ———— 
(| 29) a? na? = Tae 
p) < 1/4 again. The question now becomes: when a = 0.1, for 


where we used p (1 — 
2) < 0.1? We find: n > 1/(4-0.1- (0.1)?) = 250, so n = 250 is large 


what n is 1/(4na 
enough. 


13.4c From part a we know that an error of size 0.2 or occur with a probability 
of at most 25/4n, regardless of the values of p. So, we need 25/(4n) < 0.05, i-e., 
n > 25/(4- 0.05) = 125. 


13.4d We compute Pl An < 0.5) for the case that p = 0.6. Then E [Xn] = 0.6 
and Var (Xn) = 0.6-0.4/n. Chebyshev’s inequality cannot be used directly, we need 
an intermediate step: the probability that X, < 0.5 is contained in the event “the 
prediction is off by at least 0.1, in either direction.” So 

0.6-0.4/n 24 


P(Xn < 0.5) < P(|Xn — 0.6] > 0.1) < eae = 


For n > 240 this probability is 0.1 or smaller. 


13.9a The statement looks like the law of large numbers, and indeed, if we look 
more closely, we see that T;, is the average of an i.i.d. sequence: define Y; = X?, 
then T, = Yn. The law of large numbers now states: if Y, is Do average of n 
independent random variables with expectation and variance o”, then for any 
€ > 0: limn—o.o P (|¥n —p|> 2) = = 0. So, if a = w and the variance o? is finite, then 
it is true. 


13.9 b We compute expectation and variance of Y;: E[Yi] = E[X?] = lie se 2 dar = 
1/3. And: E[¥?] = E[X?] = f?, 42% da = 1/5, so Var(¥;) = 1/5 — (1/3)? = 4/45. 
The variance is finite, so indeed, the law of large numbers applies, and the statement 
is true if a = E[X?] = 1/3. 

14.3 First note that P(|Xn — p| < 0.2) = 1—P(X, — p > 0.2)—P(Xn — p < —0.2). 
Because pu = p and 0? = p(1— p), we find, using the central limit theorem: 


P(X, —p>0.2) =P 


- Xn—p - 0.2 
(vas —p) re —) 


0.2 
P(tn2 vier) «(22 va 7): 


D Full solutions to selected exercises 459 


where Z has an N(0,1) distribution. Similarly, 
P(Xn —p< —0.2) ~) P(z > vite) ; 
so we are looking for the smallest positive integer n such that 


1—a0(z > i) ie 


i.e., the smallest positive integer n such that 


0.2 


(2 > vit) < 0.05. 


From Table B.1 it follows that 


ei > 1.645. 


p(1 — p) 
Since p(1 — p) < 1/4 for all p between 0 and 1, we see that n should be at least 17. 
14.5 In Section 4.3 we have seen that X has the same probability distribution 
as X1 + Xo +---+ Xn, where X1, X2,...,Xn are independent Ber(p) distributed 
random variables. Recall that E[_X;] = p, and Var(X;) = p(1—p). But then we have 
for any real number a that 


SE ey |) Sp ee eee 2g = ee eae 
np(1 — p) np(1 — p) 


see also (14.1). It follows from the central limit theorem that 


X — np 
P| ————— _. <a} & &a), 
( mp5) m 


V/np(1—p) 


i.e., the random variable has a distribution that is approximately standard 


normal. 

14.9a The probability that for a chain of at least 50 meters more than 1002 links 
are needed is the same as the probability that a chain of 1002 chains is shorter than 
50 meters. Assuming that the random variables X1, X2,..., X1002 are independent, 
and using the central limit theorem, we have that 


5000 


oo — 5 
P(X1 + X2+---+ X1002 < 5000) = e(Z < V1002 - ae = 0.0571, 


where Z has an N(0,1) distribution. So about 6% of the customers will receive a 
free chain. 


14.9b We now have that 
P(X1 + X2 +--+ + Xi002 < 5000) ~ P(Z < 0.0032) , 


which is slightly larger than 1/2. So about half of the customers will receive a free 
chain. Clearly something has to be done: a seemingly minor change of expected value 
has major consequences! 


460 D Full solutions to selected exercises 


15.6 Because (2 — 0) - 0.245 + (4 — 2) - 0.130 + (7 — 4) - 0.050 + (11 — 7) - 0.020 + 
(15 — 11) - 0.005 = 1, there are no data points outside the listed bins. Hence 


number of x7; < 7 


F,,(7) = . 
_ number of 2; in bins (0, 2], (2, 4] and (4, 7] 
n 
n+ (2—0) + 0.245 +-n- (4—2)- 0.130 + n- (7 —4) - 0.050 


n 


= 0.490 + 0.260 + 0.150 = 0.9. 


15.11 The height of the histogram on a bin (a, }] is 


number of x; in (a,b] _ (number of 2; < b) — (number of x; < a) 
n(b— a) > n(b— a) 
_ Filb) = Fn(@) 
7 b-—a : 


15.12a By inserting the expression for f,,,(t), we get 


[- thoadar= fe ow (Se) a 


are 


For each 7 fixed we find with change of integration variables u = (t — x;)/h, 


[Tix (528) 0 [Termin 


See Ku) du+h [ uk (u) du = 2, 


using that K integrates to one and that tie uk (u) du = 0, because K is symmetric. 


Hence 2 
fit frn(t -i>f- 5K(S *) a= Tyo 


15.12b By means of similar reasoning 


ie 2 _ 2 2 al ib t-— 4; 
ft fan(dar= ft or Bea i ) ae 


For each 7: 


D Full solutions to selected exercises 461 


= / (x; + hu)? K (u) du = | (x? + Qashu + h?u?)K (u) du 


—co —co 


=a} [ K (u) du+2nh [ uK (u) auth? f u’K (u) du 


—oo 


=a}+ nf u’K (u) du, 


again using that K integrates to one and that K is symmetric. 


16.3a Because n = 24, the sample median is the average of the 12th and 13th 
elements. Since these are both equal to 70, the sample median is also 70. The lower 
quartile is the pth empirical quantile for p = 1/4. We get k = |p(n + 1)| = 6, so 
that 


Qn. (0.25) = 26) + 0.25 - (x7) — @(6)) = 66 + 0.25 - (67 — 66) = 66.25. 
Similarly, the upper quartile is the pth empirical quantile for p = 3/4: 
Qn(0.75) = £1) + 0.75 - (a9) — fag)) = 75+ 0.75 - (75 — 75) = 75. 


16.3b In part a we found the sample median and the two quartiles. From this we 
compute the IQR: gn (0.75) — gn (0.25) = 75 — 66.25 = 8.75. This means that 

gn (0.25) — 1.5-IQR = 66.25 — 1.5- 8.75 = 53.125, 

dn(0.75) + 1.5-IQR = 75 + 1.5- 8.75 = 88.125. 
Hence, the last element below 88.125 is 88, and the first element above 53.125 is 57. 


Therefore, the upper whisker runs until 88 and the lower whisker until 57, with two 
elements 53 and 31 below. This leads to the following boxplot: 


81 
75 


70 
66.25 


57 


31 


16.3c The values 53 and 31 are outliers. Value 31 is far away from the bulk of the 
data and appears to be an extreme outlier. 

16.6a Yes, we find @ = (1+5+9)/3 = 15/3 = 5, 7 = (2+4+6+4+8)/4 = 20/4 =5, 
so that (+ y)/2 = 5. The average for the combined dataset is also equal to 5: 
(15 + 20)/7 =5. 


16.6b The mean of 21, 22,...,%n,Y1,Y2;---;Ym equals 


Lite tin tyr te +Ym — Nn +MYm _ n oe 
thm 7 Rem ~~ ntm * nem 


462 D Full solutions to selected exercises 


In general, this is not equal to (Zn +Ym)/2. For instance, replace 1 in the first dataset 
by 4. Then Z, = 6 and Jm = 5, so that (Fn + Ym)/2 = 5S. However, the average of 
the combined dataset is 38/7 = 52. 

16.6c Yes, m =n implies n/(n +m) = m/(n +m) = 1/2. From the expressions 
found in part b we see that the sample mean of the combined dataset equals (Zn + 
Ym)/2. 

16.8 The ordered combined dataset is 1, 2, 4, 5, 6, 8, 9, so that the sample median 
equals 5. The absolute deviations from 5 are: 4, 3, 1, 0, 1, 3, 4, and if we put them in 
order: 0, 1, 1, 3, 3, 4, 4. The MAD is the sample median of the absolute deviations, 
which is 3. 


16.15 First write 


n n 


1 _ 1 _ : i . ie oe 
_ Pi ah = 2a enter en) SD e ,  D 


i=l i=1 


Next, by inserting 


n n 
1 a 1 2 1 <2 2 
-5 i=, and — t= ND, = Das 
n +4 n 4 n 
i=l i=l 
we find 
n n n 
1 a9 J 2 9 So ld 2 2 
a (i -—Zn) = — av, —2@7,4+2,=- v;— Zi. 
i=1 t=1 i=1 


17.3a The model distribution corresponds to the number of women in a queue. A 
queue has 10 positions. The occurrence of a woman in any position is independent 
of the occurrence of a woman in other positions. At each position a woman occurs 
with probability p. Counting the occurrence of a woman as a “success,” the number 
of women in a queue corresponds to the number of successes in 10 independent 
experiments with probability p of success and is therefore modeled by a Bin(10, p) 
distribution. 


17.3b We have 100 queues and the number of women 2; in the ith queue is a 
realization of a Bin(10,p) random variable. Hence, according to Table 17.2, the 
average number of women £100 resembles the expectation 10p of the Bin(10, p) 
distribution. We find Z100 = 435/100 = 4.35, so an estimate for p is 4.35/10 = 0.435. 


17.7 a If we model the series of disasters by a Poisson process, then as a property of 
the Poisson process, the interdisaster times should follow an exponential distribution 
(see Section 12.3). This is indeed confirmed by the histogram and empirical distri- 
bution of the observed interdisaster times; they resemble the probability density and 
distribution function of an exponential distribution. 


17.7b The average length of a time interval is 40 549/190 = 213.4 days. Following 
Table 17.2 this should resemble the expectation of the Exp(A) distribution, which 
is 1/X. Hence, as an estimate for A we could take 190/40 549 = 0.00469. 


17.9a A (perfect) cylindrical cone with diameter d (at the base) and height h has 
volume 1d7h/12, or about 0.26d?h. The effective wood of a tree is the trunk without 
the branches. Since the trunk is similar to a cylindrical cone, one can expect a linear 
relation between the effective wood and d?h. 


D Full solutions to selected exercises 463 


17.9b We find 
z, — Lyle _ 9369 _ 4 aq99 
ra al 
gz Qewdln _ 6486/31 _ 9 s49g 


(Q.2i)/n  87.456/31 
So iyi _ 95.498 


eG 


least squares = 


18.3a Note that generating from the empirical distribution function is the same as 
choosing one of the elements of the original dataset with equal probability. Hence, 
an element in the bootstrap dataset equals 0.35 with probability 0.1. The number 
of ways to have exactly three out of ten elements equal to 0.35 is () and each has 
probability (0.1)*(0.9)". Therefore, the probability that the bootstrap dataset has 


exactly three elements equal to 0.35 is equal to (2) (0.1)3(0.9)" = 0.0574. 


18.3b Having at most two elements less than or equal to 0.38 means that 0, 1, 
or 2 elements are less than or equal to 0.38. Five elements of the original dataset 
are smaller than or equal to 0.38, so that an element in the bootstrap dataset is 
less than or equal to 0.38 with probability 0.5. Hence, the probability that the 
bootstrap dataset has at most two elements less than or equal to 0.38 is equal to 
(0.5) + (77)(0.5)" + (7) (0.5) = 0.0547. 


18.3c Five elements of the dataset are smaller than or equal to 0.38 and two 
are greater than 0.42. Therefore, obtaining a bootstrap dataset with two elements 
less than or equal to 0.38, and the other elements greater than 0.42 has probabil- 
ity (0.5)? (0.2)°. The number of such bootstrap datasets is (7). So the answer is 


(2°) (0.5)? (0.2)® = 0.000029. 


18.7 For the parametric bootstrap, we must estimate the parameter 0 by 6 = 
(n + 1)m,/n, and generate bootstrap samples from the U(0,0) distribution. This 
distribution has expectation jug = 6/2 = (n+1)mn/(2n). Hence, for each bootstrap 
sample 2], 7%3,...,%;, compute Z7, — pug = F7, — (n+ 1)mp/(2n). 

Note that this is different from the empirical bootstrap simulation, where one would 
estimate 4p by Z, and compute 7% — Zn. 


18.8a Since we know nothing about the distribution of the interfailure times, we 
estimate F’ by the empirical distribution function F;, of the software data and we 
estimate the expectation yu of F’ by the expectation uw* = Zn = 656.8815 of Fh. 
The bootstrapped centered sample mean is the random variable X* — 656.8815. The 
corresponding empirical bootstrap simulation is described as follows: 


1. Generate a bootstrap dataset 2j,75,..., 7% from Fy, i.e., draw with replacement 
135 numbers from the software data. 
2. Compute the centered sample mean for the bootstrap dataset: 


Zp — 656.8815 


where Z, is the sample mean of rj,75,...,23,. 


Repeat steps 1 and 2 one thousand times. 


18.8b Because the interfailure times are now assumed to have an Exp(A) distribu- 
tion, we must estimate A by A = 1/%, = 0.0015 and estimate F by the distribution 


464 D Full solutions to selected exercises 


function of the Exp(0.0015) distribution. Estimate the expectation pz = 1/A of the 
Ezxp(A) distribution by y* = ine = En = 656.8815. Also now, the bootstrapped 
centered sample mean is the random variable X* — 656.8815. The corresponding 
parametric bootstrap simulation is described as follows: 


1. Generate a bootstrap dataset xj,73,...,27, from the Exp(0.0015) distribution. 
2. Compute the centered sample mean for the bootstrap dataset: 


Ey, — 656.8815, 


where Z, is the sample mean of xj, 75,...,U},. 


Repeat steps 1 and 2 one thousand times. We see that in this simulation the boot- 


strapped centered sample mean is the same in both cases: X* — Zn, but the corre- 
sponding simulation procedures differ in step 1. 


18.8c Estimate by \=In 2/mn = 0.0024 and estimate F’ by the distribution 
function of the Exp(0.0024) distribution. Estimate the expectation 4p = 1/A of the 
Exp (A) distribution by u* = 1/d = 418.3816. The corresponding parametric boot- 
strap simulation is described as follows: 


1. Generate a bootstrap dataset xj, 73,...,27, from the Exp (0.0024) distribution. 
2. Compute the centered sample mean for the bootstrap dataset: 


Ey, — 418.3816, 


where Z, is the sample mean of x}, x3,...,25,. 


Repeat steps 1 and 2 one thousand times. We see that in this parametric bootstrap 
simulation the bootstrapped centered sample mean is different from the one in the 
empirical bootstrap simulation: X;, — (In2)/my instead of X; — En. 


19.1a From the formulas for the expectation and variance of uniform random 
variables we know that E[Xi] = 0 and Var(X;:) = (20)7/12 = 67/3. Hence 
E[X7] = Var(X;) + (E[Xi])? = 07/3. Therefore, by linearity of expectations 


2 2 2 
E[T] 2 (StF) a3 Dae 
n\3 3 n 


Since E[T] = 6?, the random variable T is an unbiased estimator for 6. 

19.1b The function g(x) = —./z is a strictly convex function, because g(r) = 
(a~*/4) /4 > 0. Therefore, by Jensen’s inequality, —\/E[T] < —E [v7]. Since, from 
part a we know that E/T] = 6”, this means that E [vT| < @. In other words, VT 
is a biased estimator for 0, with negative bias. 


19.8 From the model assumptions it follows that E[Y;] = Ga; for each i. Using 
linearity of expectations, this implies that 


pip) = + (EE y... 4 Ul) a2 (Sa SB) ag, 


n v1 ln n Ly Ca 
Bie Ce ae) ee 
Lites +2n ry te+++2n : 
_ mE(Yilt-++anB[Ya] — Belts +602 _ 
E[B3] = = eerie] rr =f. 


D Full solutions to selected exercises 465 


20.2a Compute the mean squared errors of S and T: MSE(S) = Var(S) + 
[bias(S)]? = 40+ 0 = 40; MSE(T) = Var(T) + [bias(T)]? = 4+9 = 13. We prefer T, 
because it has a smaller MSE. 

20.2 b Compute the mean squared errors of S and T: MSE(S) = 40, as in a; 
MSE(T) = Var(T) + [bias(T)]? = 4+ a?. So, if a < 6: prefer T. If a > 6: prefer S. 
The preferences are based on the MSE criterion. 

20.3 Var(T:) = 1/(n\*), Var(T2) = 1/A?; hence we prefer T;, because of its smaller 
variance. 


20.8a This follows directly from linearity of expectations: 


E(T] = E[rXn+(1—r)Y¥m] =rE[Xn]+(1—r)E [Ym] =rut+(l—r)w =u. 


20.8b Using that X, and Ym are independent, we find MSE(T)=Var(T) = 
r?Var(Xn) + (L—r)?Var(Ym) = 1? -0?/n + (L—r)?-0?/m. 

To find the minimum of this parabola we differentiate with respect to r and 
equate the result to 0: 2r/n — 2(1—1r)/m = 0. This gives the minimum value: 
2rm — 2n(1—r)=0orr=n/(n+m). 


21.1 Setting X; = j if red appears in the ith experiment for the first time on the 
jth throw, we have that X1, X2, and X3 are independent Geo(p) distributed random 
variables, where p is the probability that red appears when throwing the selected 
die. The likelihood function is 


L(p) = P(Xi = 3, X2 =5,X3 =4) =(1—p)*p- (1—p)*p- (1—p)*p 
= p(1—p)®, 


so for D; one has that L(p) = L(3) = (2)° (1- ay whereas for D2 one has that 


L(p) = L(4) = (4)° (1— 4)° = 59 - L(8). It is very likely that we picked Do. 


6 
21.4a The likelihood L(y) is given by 


L(y) = P(X = 21,...,Xn = a) = P(X1 = 21)--P(Xn = 2a) 


G1, x —np 
e vy tagt--+2an 


x1! Ln! w1!-++ ay! 


21.4b We find that the loglikelihood (jy) is given by 


(uw) = (>: a] In(js) — In (a!-+- an!) — np. 


Hence 

du LL 
and we find—after checking that we indeed have a maximum!—that Z, is the max- 
imum likelihood estimate for pu. 


21.4c In b we have seen that £, is the maximum likelihood estimate for jz. Due to 
the invariance principle from Section 21.4 we thus find that e-*” is the maximum 
likelihood estimate for e”. 


466 D Full solutions to selected exercises 


21.8a The likelihood L(6) is given by 


Ae aEe (ze * ®) ~ (59) . (Za = 0) (Zo - 0) i 


= Sa Coe ia, 


where C is the number of ways we can assign 1997 starchy-greens, 32 sugary-whites, 
906 starchy-whites, and 904 sugary-greens to 3839 plants. Hence the loglikelihood 
£(0) is given by 
£(0) = In(C) — 3839 In(4) + 1997 In(2 + 6) + 32 In(6) + 1810 In(1 — 6). 
21.8b A short calculation shows that 
a) =0 © — 38106? — 16550 —64=0, 


so the maximum likelihood estimate of @ is (after checking that L(0) indeed attains 
a maximum for this value of @): 


—1655 + V3714385 
7620 
21.8c In this general case the likelihood L(6) is given by 


10) =C. (Ge 4 0) : (42) ot (Za - 0) “ (Za - 0) 4 
C 


ee) ee ia Sa) all 


= 0.0357. 


where C is the number of ways we can assign 1 starchy-greens, n2 sugary-whites, 
ng starchy-whites, and na sugary-greens to n plants. Hence the loglikelihood (6) is 
given by 


£(@) = In(C) — nIn(4) + n1 In(2 + 6) + n2 In(@) + (nz + n4) In(1 — 8). 


A short calculation shows that 
dé(0) 
dé 
so the maximum likelihood estimate of @ is (after checking that L(0) indeed attains 
a maximum for this value of 0): 


=0 eS nO? — (ni — nz — 2ng — 2n4)O — 2nz = 0, 


ny — ng — 2n3 — 2n4 4 (ni — ng — 2n3 — 2na)? + 8nn2 
2n : 


21.11a Since the dataset is a realization of a random sample from a Geo(1/N) 
distribution, the likelihood is L(V) P(X = 21, X2 = %2,..., Xn = Xn), where 
each X; has a Geo(1/N) distribution. So 


1 xvy—1 1 1 wg—-1 1 1 Ln—1 1 
LN) = (1-5) x(-y) wo Q-w) N 


(Gy 


D Full solutions to selected exercises 467 


But then the loglikelihood is equal to 


(N) =-nin + (-n+ 5-2) n(1- 5), 


i=l 


Differentiating to N yields 


d —n 1 
an (2%) = wt (14+ Ds) aay 


Now <4 (¢(N)) = 0 if and only if N = Zn. Because ¢(N) attains its maximum at 
Zn, we find that the maximum likelihood estimate of N is N= Tas 


21.11b Since P(Y =k) = 1/N for k = 1,2,...,.N, the likelihood is given by 


L(N) = (+) for N > yn); 


and L(N) = 0 for N < yn). So L(N) attains its maximum at y,); the maximum 
likelihood estimate of N is N = Y(n) 


22.1a Since So iy; = 12.4, Sta; = 9, Sy; = 4.8, 3 x? = 35, and n = 3, we find 
(c.f. (22.1) and (22.2)), that 


mdi a? — (5 ai)? ~ 3-35-92? 
and & = Yn — BEn = 2.35. 
22.1b Since r; = yi ~&— Bai, for? =1,...,n, we find r; = 2—2.35+0.25 = —0.1, 
rg = 1.8 — 2.35 4+ 0.75 = 0.2, rg = 1 — 2.35 4 1.25 0.1, and ri +re+7r3 = 
0.14+0.2-—0.1=0. 


22.1¢ See Figure D.1. 


=: 0) 1 2 3 4 5 6 


Fig. D.1. Solution of Exercise 22.1 c. 


468 D Full solutions to selected exercises 


22.5 With the assumption that a = 0, the method of least squares tells us now to 
minimize 
n 


S(8) = >° (ys — Bai)’. 


4=1 


Now 
a =-2 S (yi = Bui) xi =-2 (>: TiYyi — ya?) : 
i=l tA saa 
so 
dS(3) 0 Se B it ViVi 
dg ei 


Because $() has a minimum for this last value of 3, we see that the least squares 


estimator B of @ is given by 
* y a LiYi 
B= Sa 


i= Yi 


22.12a Since the denominator of B is a number, not a random variable, one has 


that 
i) - Bea) Tayewy) 
ay) ay — (bai)? , 


Furthermore, the numerator of this last fraction can be written as 


E sea —E bs OM ¥,)| 5 
n (eB [¥i)) — (Sas) ED. 


22.12b Substituting E[Y;] = a+ G2; in the last expression, we find that 


5) _ nD (eila + Bas)) — (Ox) Hla + Ba) 
5[4| = rea (ei : 


22.12c The numerator of the previous expression for E [4] can be simplified to 


no S 2s +8 Ysa? — na Yai ~ BO a)(O x) 
nyt (Da) | 


which is equal to 


which is equal to 
Bnd 2 - (Das)?) 
nyse = (ay 
22.12d From c it now follows that E [4] = B,ie., B is an unbiased estimator for (. 


23.5a The standard confidence interval for the mean of a normal sample with 
unknown variance applies, with n = 23, Z = 0.82 and s = 1.78, so: 


s 8 
Z — t22,0.025 + ==, © + £22,0.025 - =) . 
( V 23 V 23 


The critical values come from the t(22) distribution: t22,0.025 = 2.074. The actual 
interval becomes: 


(082 — 2.074 - Ga 0.82 + 2.074 - at = (0.050, 1.590). 


23 23 


D Full solutions to selected exercises 469 


23.5 b Generate one thousand samples of size 23, by drawing with replacement 
from the 23 numbers 


1.06, 1.04, 2.62, ..., 2.01. 


For each sample x7}, 23,...,253 compute: t* = 33 — 0.82/(s33/V23), where s33 = 
35 2 (@} — £3,)?. 

23.5c We need to estimate the critical value cj such that P(T™* < cj) = 0.025. We 

take cf = —2.101, the 25th of the ordered values, an estimate for the 25/1000 = 0.025 

quantile. Similarly, cj is estimated by the 976th, which is 2.088. 

The bootstrap confidence interval uses the c* values instead of the t-distribution 


values +tn_1,a/2, but beware: cf is from the left tail and appears on the right-hand 
side of the interval and c%, on the left-hand side: 


(2 22. s gee 
n ~~ Cy FH 3 ne OS * 
Jn Jn 


Substituting cf = —2.101 and c% = 2.088, the confidence interval becomes: 


1.78 1.78 
0.82 — 2.088 - —=, 0.82 + 2.101 - —= } = (0.045, 1.600). 
( V23 =) ( ) 


23.6a Because events described by inequalities do not change when we multi- 
ply the inqualities by a positive constant or add or subtract a constant, the 


following equalities hold: P(En 292 on) = Pal, +72 307230. 40 <= 
P(3Ln < 3u < 3Un) = P(Ln < pp < Un), and this equals 0.95, as is given. 
23.6 b The confidence interval for @ is obtained as the realization of te Ua); that 


is: (In, tin) = (3ln + 7,3Un + 7). This is obtained by transforming the confidence 
interval for yw (using the transformation that is applied to pu to get @). 


23.6c We start with P(Ln << Un) = 0.95 and try to get 1 — yw in the mid- 
dle: P(In << Un) = P(—In >-p > —-Un) = PO-LIn >1-p>1-Un) = 
P(1—U, <1—w<1-—Ln), where we see that the minus sign causes an inter- 
change: Im =1—Un and U, =1— Ln. The confidence interval: (1—5,1-—(-2)) = 
(—4, 3). 

23.6d If we knew that L, and U; were always positive, then we could conclude: 
P(Ln < p< Un) = P(t <ww< U,) and we could just square the numbers in the 
confidence interval for 4 to get the one for 0. Without the positivity assumption, the 
sharpest conclusion you can draw from Ly < ps < Up is that pi? is smaller than the 
maximum of L?, and U;. So, 0.95 = P(Ln < p< Un) < P(0 < p? < max{L?, Uz}) 
and the confidence interval [0, max{I?,u2}) = [0,25) has a confidence of at least 
95%. This kind of problem may occur when the transformation is not one-to-one 
(both —1 and 1 are mapped to 1 by squaring). 


23.11a For the 98% confidence interval the same formula is used as for the 95% 
interval, replacing the critical values by larger ones. This is the case, no matter 
whether the critical values are from the normal or t-distribution, or from a bootstrap 
experiment. Therefore, the 98% interval contains the 95%, and so must also contain 
the number 0. 


470 D Full solutions to selected exercises 


23.11b From a new bootstrap experiment we would obtain new and, most prob- 
ably, different values cz, and cj. It therefore could be, if the number 0 is close to 
the edge of the first bootstrap confidence interval, that it is just outside the new 
interval. 


23.11c¢c The new dataset will resemble the old one in many ways, but things like the 
sample mean would most likely differ from the old one, and so there is no guarantee 
that the number 0 will again be in the confidence interval. 


24.6a The environmentalists are interested in a lower confidence bound, because 
they would like to make a statement like “We are 97.5% confidence that the con- 
centration exceeds 1.68 ppm [and that is much too high.]” We have normal data, 
with o unknown so we use sig = V1.12 = 1.058 as an estimate and use the criti- 
cal value corresponding to 2.5% from the ¢(15) distribution: t15,0.025 = 2.131. The 
lower confidence bound is 2.24— 2.131-1.058//16 = 2.24—0.56 = 1.68, the interval: 
(1.68, 00). 

24.6b For similar reasons, the plant management constructs an upper confidence 
bound (“We are 97.5% confident pollution does not exceed 2.80 [and this is ac- 
ceptable.]”). The computation is the same except for a minus sign: 2.24 + 2.131 - 
1.058//16 = 2.24 + 0.56 = 2.80, so the interval is [0, 2.80). Note that the computed 
upper and lower bounds are in fact the endpoints of the 95% two-sided confidence 
interval. 


24.9a From Section 8.4 we know: P(M <a) = [Fx(a)]'*, so P(M/0<t) = 
P(M < 6t) = [Fx(6t)|'”. Since X; has a U(0,0) distribution, Fx(6t) = t, for 
0 <t<1. Substituting this shows the result. 

24.9b For c we need to solve (q1)!? = a/2, or c, = (a/2)'/? = (0.05)'/? = 0.7791. 
For cy we need to solve (c,)!? = 1—a//2, or cy, = (1—a/2)/ = (0.95)/!? = 0.9958. 
24.9c From b we know that P(c: < M/@ < cu) = P(0.7790 < M/0 < 0.9958) = 
0.90. Rewriting this equation, we get: P(0.77900 < M < 0.99580) = 0.90 and 
P(M/0.9958 < 0 < M/0.7790) = 0.90. This means that (m/0.9958, m/0.7790) = 
(3.013, 3.851) is a 90% confidence interval for 0. 


24.9d From b we derive the general formula: 


M 


P((a/2)*” eS. - a/2)*"") =l-a. 


The left hand inequality can be rewritten as 9 < M/(a/2)!/" and the right hand 
one as M/(1— a/2)'/" <0. So, the statement above can be rewritten as: 


M M 
(a —ajae <°< aT) = 


so that the general formula for the confidence interval becomes: 


25.4a Denote the observed numbers of cycles for the smokers by X1, X2,...,Xn, 
and similarly Y1,Y2,...,¥n. for the nonsmokers. A test statistic should compare 
estimators for p; and pz. Since the geometric distributions have expectations 1/p1 


D Full solutions to selected exercises 471 


and 1/p2, we could compare the estimator 1/X,, for pi with the estimator 1/Yn, for 
p2, or simply compare Kn with Vang For instance, take test statistic T = Xn =—Vnis: 
Values of T close to zero are in favor of Ho, and values far away from zero are in 
favor of H;. Another possibility is T = Xs / View» 

25.4b In this case, the maximum likelihood estimators p; and 2 give better indi- 
cations about p: and p2. They can be compared in the same way as the estimators 
in a. 

25.4c The probability of getting pregnant during a cycle is p; for the smoking 
women and p2 for the nonsmokers. The alternative hypothesis should express the 
belief that smoking women are less likely to get pregnant than nonsmoking women. 
Therefore take Hy : pi < po. 


25.10a The alternative hypothesis should express the belief that the gross calorific 
exceeds 23.75 MJ/kg. Therefore take Hi : > 23.75. 


25.10b The p-value is the probability P(Xn Ps 23.788) under the null hypothesis. 
We can compute this probability by using that under the null hypothesis X, has an 
N(23.75, (0.1)?/23) distribution: 


Xp — 23.75 _ 23.788 — 23.75 
0.1/V23 ~ 0.1/V23 
where Z has an N(0,1) distribution. From Table B.1 we find P(Z > 1.82) = 0.0344. 


25.11 A type I error occurs when pu = 0 and |t| > 2. When yw = 0, then T has an 
N(0,1) distribution. Hence, by symmetry of the N(0,1) distribution and Table B.1, 
we find that the probability of committing a type I error is 


P(X, > 23.788) = Et ) = P(Z > 1.82), 


P(|T| > 2) =P(T < —2)+P(T > 2) =2-P(T > 2) = 2- 0.0228 = 0.0456. 


26.5a The p-value is P(X > 15) under the null hypothesis Ho : p = 1/2. Using 
Table 26.3 we find P(X > 15) = 1—P(X < 14) = 1 — 0.8950 = 0.1050. 


26.5 b Only values close to 23 are in favor of Hi : p > 1/2, so the critical region is 
of the form K = {c,c+1,...,23}. The critical value c is the smallest value, such 
that P(X > c) < 0.05 under Ho : p = 1/2, or equivalently, 1 — P(X < c—1) < 0.05, 
which means P(X < c— 1) > 0.95. From Table 26.3 we conclude that c— 1 = 15, so 
that K = {16,17,...,23}. 

26.5c A type I error occurs if p = 1/2 and X > 16. The probability that this 
happens is P(X > 16|p=1/2) =1—P(X < 15|p=1/2) = 1 - 0.9534 = 0.0466, 
where we have used Table 26.3 once more. 


26.5d In this case, a type II error occurs if p = 0.6 and X < 15. To approximate 
P(X < 15|p=0.6), we use the same reasoning as in Section 14.2, but now with 
n = 23 and p = 0.6. Write X as the sum of independent Bernoulli random variables: 
X = R,+---+ Ry, and apply the central limit theorem with 4 = p = 0.6 and 
o? = p(1— p) = 0.24. Then 


POPS 15) S PR BRS 1B) 
-p(Be eet Pe ae 
an ~ an 
15 — 13.8 


ee 
( 8 = 70.24/23 


) ~ (0.51) = 0.6950. 


472 D Full solutions to selected exercises 


26.8a Test statistic T = X,, takes values in (0,00). Recall that the Exp() distri- 
bution has expectation 1/A, and that according to the law of large numbers X,, will 
be close to 1/A. Hence, values of X,, close to 1 are in favor of Ho : \ = 1, and only 
values of X,, close to zero are in favor H, : \ > 1. Large values of X,, also provide 
evidence against Hp : A = 1, but even stronger evidence against Hi : \ > 1. We 
conclude that T = X,, has critical region K = (0,c]. This is an example in which 
the alternative hypothesis and the test statistic deviate from the null hypothesis in 
opposite directions. _ 

Test statistic T’ = e~*" takes values in (0,1). Values of X;, close to zero correspond 
to values of T’ close to 1, and large values of X,, correspond to values of T’ close 
to 0. Hence, only values of T’ close to 1 are in favor H; : \ > 1. We conclude that T’ 
has critical region K’ = [c., 1). Here the alternative hypothesis and the test statistic 
deviate from the null hypothesis in the same direction. 


26.8b Again, values of X,, close to 1 are in favor of Hp : \ = 1. Values of X;, close 
to zero suggest \ > 1, whereas large values of X,, suggest \ < 1. Hence, both small 
and large values of X, are in favor of Hi : X # 1. We conclude that T = X,, has 
critical region K = (0, ci] U [cu, 00). 

Small and large values of X, correspond to values of T’ close to 1 and 0. Hence, 
values of T’ both close to 0 and close 1 are in favor of Hy; : # 1. We conclude that 
T’ has critical region K’ = (0, cj] U [ci,, 1). Both test statistics deviate from the null 


hypothesis in the same directions as the alternative hypothesis. 


26.9a Test statistic T = (Xn)? takes values in [0,00). Since ys is the expectation 
of the N(1,1) distribution, according to the law of large numbers, Xy, is close to pu. 
Hence, values of X;, close to zero are in favor of Ho : 4 = 0. Large negative values 
of X, suggest ps < 0, and large positive values of X,, suggest ps > 0. Therefore, both 
large negative and large positive values of X, are in favor of Hi : wp 4 0. These 
values correspond to large positive values of T, so T has critical region K = [cy, 00). 
This is an example in which the test statistic deviates from the null hypothesis in 
one direction, whereas the alternative hypothesis deviates in two directions. 

Test statistic T’ takes values in (—co,0) U (0,00). Large negative values and large 
positive values of X, correspond to values of T’ close to zero. Therefore, T’ has 
critical region K’ = [c},0) U (0, cj]. This is an example in which the test statistic 
deviates from the null hypothesis for small values, whereas the alternative hypothesis 
deviates for large values. 


26.9b Only large positive values of X,, are in favor of 1 > 0, which correspond to 
large values of T. Hence, T' has critical region K = [cy,0o). This is an example where 
the test statistic has the same type of critical region with a one-sided or two-sided 
alternative. Of course, the critical value c, in part b is different from the one in 
part a. 

Large positive values of X,, correspond to small positive values of T’. Hence, T” has 
critical region K’ = (0,c’,]. This is another example where the test statistic deviates 
from the null hypothesis for small values, whereas the alternative hypothesis deviates 
for large values. 


27.5a The interest is whether the inbreeding coefficient exceeds 0. Let ~ represent 
this coefficient for the species of wasps. The value 0 is the a priori specified value 
of the parameter, so test null hypothesis Ho : ~ = 0. The alternative hypothesis 
should express the belief that the inbreeding coefficient exceeds 0. Hence, we take 
alternative hypothesis H, : 4 > 0. The value of the test statistic is 


D Full solutions to selected exercises 473 


ss 0.044 _ 

0.884// 197 
27.5 b Because n = 197 is large, we approximate the distribution of T under the 
null hypothesis by an N(0,1) distribution. The value t = 0.70 lies to the right of 


zero, so the p-value is the right tail probability P(T > 0.70). By means of the normal 
approximation we find from Table B.1 that the right tail probability 


P(T > 0.70) © 1 — 8(0.70) = 0.2420. 


0.70. 


This means that the value of the test statistic is not very far in the (right) tail of 
the distribution and is therefore not to be considered exceptionally large. We do not 
reject the null hypothesis. 


27.7a The data are modeled by a simple linear regression model: Y; = a + (Gx, 
where Y; is the gas consumption and 2; is the average outside temperature in the ith 
week. Higher gas consumption as a consequence of smaller temperatures corresponds 
to 6 < 0. It is natural to consider the value 0 as the a priori specified value of the 
parameter (it corresponds to no change of gas consumption). Therefore, we take null 
hypothesis Ho : 3 = 0. The alternative hypothesis should express the belief that the 
gas consumption increases as a consequence of smaller temperatures. Hence, we take 
alternative hypothesis H; : 3 < 0. The value of the test statistic is 


B _ —0.3932 
t = — = —— = — 20.06. 
> "sy 0.0196 sc 
The test statistic T, has a t-distribution with n — 2 = 24 degrees of freedom. The 
value —20.06 is smaller than the left critical value t24,0.05 = —1.711, so we reject. 
27.7b For the data after insulation, the value of the test statistic is 
—0.2779 
= ——— = ~—11.03, 
° “0.0252 03; 


and JT; has a ¢(28) distribution. The value —11.03 is smaller than the left critical 
value t2g,0.05 = —1.701, so we reject. 


28.5a When aS% + 05% is unbiased for 07, we should have E [asx + bS}] = 0. 
Using that S% and S} are both unbiased for o”, ive., E [Sx] =o° and E [S¥] S67: 
we get 

E [aS + bS}] = aE [Sx] + bE[S}] = (a+b)o”. 
Hence, E [aS + bS¥-] = 07 for all o > 0 if and only ifa+b=1. 
28.5 b By independence of S% and S? write 


Var (aS +(1- a)S¥-) = a’ Var (Sx) +(1- a)’ Var(S¥-) 
a (1- a)? 4 
a (+ m—1 ) 20 i 
To find the value of a that minimizes this, differentiate with respect to a and put 
the derivative equal to zero. This leads to 
2a 2(1 — a) 


n-1l1 m-—1 —— 


Solving for a yields a = (n — 1)/(n +m -— 2). Note that the second derivative of 
Var (aSx +(1- a)S¥) is positive so that this is indeed a minimum. 


References 


NR 


14. 


15. 


. J. Bernoulli. Ars Conjectandi. Basel, 1713. 

. J. Bernoulli. The most probable choice between several discrepant observations 

and the formation therefrom of the most likely induction. ():3-33, 1778. With 

a comment by Euler. 

P. Billingsley. Probability and measure. John Wiley & Sons Inc., New York, 

third edition, 1995. A Wiley-Interscience Publication. 

L.D. Brown, T.T. Cai, and A. DasGupta. Interval estimation for a binomial 

proportion. Stat. Science, 16(2):101-133, 2001. 

S.R. Dalal, E.B. Fowlkes, and B. Hoadley. Risk analysis of the space shuttle: 

pre-Challenger prediction of failure. J. Am. Stat. Assoc., 84:945—957, 1989. 

J. Daugman. Wavelet demodulation codes, statistical independence, and pattern 

recognition. In Institute of Mathematics and its Applications, Proc. 2nd IMA-IP: 

Mathematical Methods, Algorithms, and Applications (Blackledge and Turner, 

Eds), pages 244-260. Horwood, London, 2000. 

B. Efron. Bootstrap methods: another look at the jackknife. Ann. Statist., 

7(1):1-26, 1979. 

W. Feller. An introduction to probability theory and its applications, Vol. II. 

John Wiley & Sons Inc., New York, 1971. 

R.A. Fisher. On an absolute criterion for fitting frequency curves. Mess. Math., 

41:155-160, 1912. 

R.A. Fisher. On the “probable error” of a coefficient of correlation deduced 

from a small sample. Metron, 1(4):3-32, 1921. 

. HS. Fogler. Elements of chemical reaction engineering. Prentice-Hall, Upper 
Saddle River, 1999. 

. D. Freedman and P. Diaconis. On the histogram as a density estimator: L2 

theory. Z. Wahrsch. Verw. Gebiete, 57(4):453-476, 1981. 

C.F. Gauss. Theoria motus corporum coelestium in sectionis conicis solem am- 

bientum. In: Werke. Band VII. Georg Olms Verlag, Hildesheim, 1973. Reprint 

of the 1906 original. 

P. Hall. The bootstrap and Edgeworth expansion. Springer-Verlag, New York, 

1992. 

R. Herz, H.G. Schlichter, and W. Siegener. Angewandte Statistik ftir Verkehrs- 

und Regionalplaner. Werner-Ingenieur-Texte 42, Werner-Verlag, Diisseldorf, 

1992. 


476 


16. 


17. 
18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


3l. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41. 


References 


J.L. Lagrange. Mémoire sur Vutilité de la méthode de prendre le milieu entre 
les résultats de plusieurs observations. Paris, 1770-73. CEvres 2, 1886. 

J.H. Lambert. Photometria. Augustae Vindelicorum, 1760. 

R.J. MacKay and R.W. Oldford. Scientific method, statistical method and the 
speed of light. Stat. Science, 15(3):254-278, 2000. 

J. Moynagh, H. Schimmel, and G.N. Kramer. The evaluation of tests for the 
diagnosis of transmissible spongiform encephalopathy in bovines. Technical re- 
port, European Commission, Directorate General XXIV, Brussels, 1999. 

V. Pareto. Cours d’economie politique. Rouge, Lausanne et Paris, 1897. 

E. Parzen. On estimation of a probability density function and mode. Ann. 
Math. Statist., 33:1065-1076, 1962. 

K. Pearson. Philos. Trans., 186:343-414, 1895. 

R. Penner and D.G. Watts. Mining information. The Amer. Stat., 45:4-9, 1991. 
Commission Rogers. Report on the space shuttle Challenger accident. Techni- 
cal report, Presidential commission on the Space Shuttle Challenger Accident, 
Washington, DC, 1986. 

M. Rosenblatt. Remarks on some nonparametric estimates of a density function. 
Ann. Math. Statist., 27:832-837, 1956. 

S.M. Ross. A first course in probability. Prentice-Hall, Inc., New Jersey, sixth 
edition, 1984. 

R. Ruggles and H. Brodie. An empirical approach to economic intelligence in 
World War II. Journal of the American Statistical Association, 42:72—91, 1947. 
E. Rutherford and H. Geiger (with a note by H. Bateman). The probability 
variations in the distribution of a particles. Phil. Mag., 6:698-704, 1910. 

D.W. Scott. On optimal and data-based histograms. Biometrika, 66(3):605-610, 
1979. 

S. Siegel and N.J. Castellan. Nonparametric statistics for the behavioral sciences. 
McGraw-Hill, New York, second edition, 1988. 

B.W. Silverman. Density estimation for statistics and data analysis. Chapman 
& Hall, London, 1986. 

K. Singh. On the asymptotic accuracy of Efron’s bootstrap. Annals of Statistics, 
9:1187-1195, 1981. 

S.M. Stigler. The history of statistics — the measurement of uncertainty before 
1900. Cambridge, Massachusetts, 1986. 

H.A. Sturges. J. Amer. Statist. Ass., 21, 1926. 

J.W. Tukey. Exploratory data analysis. Addison-Wesley, Reading, 1977. 

S.A. van de Geer. Applications of empirical process theory. Cambridge Univer- 
sity Press, Cambridge, 2000. 

J.G. Wardrop. Some theoretical aspects of road traffic research. Proceedings of 
the Institute of Civil Engineers, 1, 1952. 

C.R. Weinberg and B.C. Gladen. The beta-geometric distribution applied to 
comparative fecundability studies. Biometrics, 42(3):547-560, 1986. 

H. Westergaard. Contributions to the history of statistics. Agathon, New York, 
1968. 

E.B. Wilson. Probable inference, the law of succession, and statistical inference. 
J. Am. Stat. Assoc., 22:209-212, 1927. 

D.R. Witte et al. Cardiovascular mortality in Dutch men during 1996 European 
foolball championship: longitudinal population study. British Medical Journal, 
321:1552-1554, 2000. 


List of symbols 


0 

a 

Ac 
ANB 
ACB 
AUB 
Ber(p) 
Bin(n, p) 


Cl, Cu 


Fnh 
Gam/(a, 2) 
Geo(p) 
Hy PF, 


empty set, page 14 

significance level, page 384 

complement of the event A, page 14 

intersection of A and B, page 14 

A subset of B, page 15 

union of A and B, page 14 

Bernoulli distribution with parameter p, page 45 
binomial distribution with parameters n and p, page 48 
left and right critical values, page 388 

Cauchy distribution with parameters a en (3, page 161 
covariance between X and Y, page 139 

expectation of the random variable X, page 90, 91 
exponential distribution with parameter A, page 62 
distribution function of the standard normal distribution, page 65 
probability density of the standard normal distribution, page 65 
probability density function, page 57 

joint probability density function, page 119 
distribution function, page 44 

joint distribution function, page 118 

inverse function of distribution function F’, page 73 
empirical distribution function, page 219 

kernel density estimate, page 213 

gamma distribution with parameters a en A, page 157 
geometric distribution with parameter p, page 49 

null hypothesis and alternative hypothesis, page 374 


478 List of symbols 


L(0) 
£(0) 


likelihood function, page 317 

loglikelihood function, page 319 

sample median of a dataset, page 231 

n factorial, page 14 

normal distribution with parameters ps and 07, page 64 
sample space, page 13 

Pareto distribution with parameter a, page 63 
Poisson distribution with parameter ju, page 170 
conditional probability of A given C, page 26 
probability of the event A, page 16 

pth empirical quantile, page 234 

pth quantile or 100pth percentile, page 66 

correlation coefficient between X and Y, page 142 
sample variance of a dataset, page 233 

sample variance of random sample, page 292 
t-distribution with m degrees of freedom, page 348 
critical value of the t(m) distribution, page 348 
uniform distribution with parameters a and (, page 60 
variance of the random variable X, page 96 

sample mean of a dataset, page 231 

average of the random variables X1, ..., Xn, page 182 
critical value of the N(0,1) distribution, page 345 


Index 


addition rule 

continuous random variables 156 

discrete random variables 152 
additivity of a probability function 
Agresti-Coull method 364 
alternative hypothesis 374 
asymptotic minimum variance 322 
asymptotically unbiased 322 
average see also sample mean 

expectation and variance of 182 


ball bearing example 399 
data 399 
one-sample t-test 401 
two-sample test 421 

bandwidth 213 
data-based choice of 216 

Bayes’ rule 32 

Bernoulli distribution 45 
expectation of 100 
summary of 429 
variance of 100 

bias 290 

Billingsley, P. 199 

bimodal density 183 

bin 210 

bin width 211 
data-based choice of 212 

binomial distribution 48 
expectation of 138 
summary of 429 
variance of 141 

birthdays example 27 

bivariate dataset 207, 221 


scatterplot of 221 
black cherry trees example 267 
t-test for intercept 409 
data 266 
scatterplot 267 
bootstrap 
confidence interval 352 
dataset 273 
empirical see empirical bootstrap 
parametric see parametric boot- 


strap 
principle 270 
for Xn 270 


for X, — ww 271 
for Med, — F'"Y(0.5) 271 
for Tks 278 
random sample 270 
sample statistic 270 
Bovine Spongiform Encephalopathy 
30 
boxplot 236 
constructed for 
drilling data 238 
exponential data 261 
normal data 261 
Old Faithful data 237 
software data 237 
Wick temperatures 240 
outlier in 236 
whisker of 236 
BSE example 30 
buildings example 94 
locations 174 


480 Index 


Cauchy distribution 92,110,114, 161 
summary of 429 
center of a dataset 231 
center of gravity 90,91, 101 
central limit theorem 197 
applications of 199 
for averages 197 
for sums 199 
Challenger example 5 
data 226, 240 
change of units 105 
correlation under 142 
covariance under 141 
expection under 98 
variance under 98 
change-of-variable formula 96 
two-dimensional 136 
Chebyshev’s inequality 183 
chemical reactor example 26,61, 65 
cloud seeding example 419 
data 420 
two-sample test 422 
coal example 347 
data 347, 350 
coin tossing 16 
until a head appears 20 
coincident birthdays 27 
complement of an event 14 
concave function 112 
conditional probability 25, 26 
confidence bound 
lower 367 
upper 367 
confidence interval 
bootstrap 352 
conservative 343 
equal-tailed 347 
for the mean 345 
large sample 353 
one-sided 366, 367 
relation with testing 392 
confidence level 343 
confidence statements 342 
conservative confidence interval 343 
continuous random variable 57 
convex function 107 
correlated 
negatively 139 
positively 139 


3, 343 


versus independent 140 
correlation coefficient 142 

dimensionlessness of 142 

under change of units 142 
covariance 139 

alternative expression of 139 

under change of units 141 
coverage probabilities 354 
Cramér-Rao inequality 305 
critical region 386 
critical values 

in testing 386 

of t-distribution 348 

of N(0,1) distribution 433 

of standard normal distribution 345 
cumulative distribution function 44 


darts example 

dataset 

bivariate 221 

center of 231 

five-number summary of 236 

outlier in 232 

univariate 210 

degrees of freedom 348 

DeMorgan’s laws 15 

density see probability density 
function 

dependent events 33 

discrete random variable 42 

discrete uniform distribution 54 

disjoint events 15,31, 32 

distribution 

t-distribution 348 

Bernoulli 45 

binomial 48 

Cauchy 114,161 

discrete uniform 54 

Erlang 157 

exponential 62 

gamma _ 157 

geometric 49 

hypergeometric 54 

normal 64 

Pareto 63 

Poisson 170 

uniform 60 

Weibull 86 

distribution function 44 


59, 60, 69 


joint 
bivariate 118 
multivariate 122 
marginal 118 
properties of 45 
drill bits 89 
drilling example 221,415 
boxplot 238 
data 222 
scatterplot 223 
two-sample test 418 
durability of tires 356 


efficiency 

arbitrary estimators 305 
relative 304 

unbiased estimators 303 
efficient 303 

empirical bootstrap 272 

simulation 


for centered sample mean 274, 275 
for nonpooled studentized mean 


difference 421 


for pooled studentized mean 


difference 418 


for studentized mean 351, 403 
empirical distribution function 219 


computed for 
exponential data 260 
normal data 260 
Old Faithful data 219 
software data 219 


law of large numbers for 249 


relation with histogram 220 
empirical percentile 234 
empirical quantile 234, 235 


law of large numbers for 252 


of Old Faithful data 235 
envelopes on doormat 14 
Erlang distribution 157 
estimate 286 

nonparametric 255 
estimator 287 

biased 290 

unbiased 290 
Euro coin example 369, 388 
events 14 

complement of 14 

dependent 33 


Index 


disjoint 15 
independent 33 
intersection of 14 
mutually exclusive 15 
union of 14 

Example 
alpha particles 354 
ball bearings 399 
birthdays 27 
black cherry trees 409 
BSE 30 
buildings 94 
Challenger 5, 226, 240 
chemical reactor 26 
cloud seeding 419 
coal 347 
darts 59 
drilling 221,415 
Euro coin 369, 388 
freeway 383 
iris recognition 1 
Janka hardness 223 
jury 75 
killer football 3 
Monty Hall quiz 4,39 
mortality rate 405 
network server 285, 306 
Old Faithful 207, 404 
Rutherford and Geiger 354 
Shoshoni Indians 402 
software reliability 218 
solo race 151 
speed of light 9, 246 
tank 7, 299, 373 
Wick temperatures 231 

expectation 
linearity of 137 


of a continuous random variable 


of a discrete random variable 


expected value see expectation 


explanatory variable 257 
exponential distribution 62 
expectation of 93, 100 
memoryless property of 62 

shifted 364 
summary of 429 
variance of 100 


factorial 14 


481 


91 


482 Index 


false negative 30 

false positive 30 

Feller, W. 199 

1500 m speedskating 357 

Fisher, R.A. 316 

five-number summary 236 
of Old Faithful data 236 
of Wick temperatures 240 

football teams 23 

freeway example 383 


gamma distribution 
summary of 429 

Gaussian distribution 

distribution 

Geiger counter 167 

geometric distribution 49 
expectation of 93, 153 
memoryless property of 50 
summary of 429 

geometric series 20 

golden rectangle 402 

gross calorific value 347 


157, 172 


see normal 


heart attack 3 
heteroscedasticity 334 
histogram 190, 211 
bin of 210 
computed for 
exponential data 260 
normal data 260 
Old Faithful data 210, 211 
software data 218 
constructed for 
deviations T and M_ 78 
juror 1 scores 78 
height of 211 
law of large numbers for 250 
reference point of 211 
relation with F,, 220 
homogeneity 168 
homoscedasticity 334 
hypergeometric distribution 54 


independence 
of events 33 
three or more 34 
of random variables 124 
continuous 125 


discrete 125 
propagation of 126 

pairwise 35 

physical 34 

statistical 34 

stochastic 34 

versus uncorrelated 140 
independent identically distributed 

sequence 182 

indicator random variable 188 
interarrival times 171 
intercept 257 
Interquartile range see IQR 
intersection of events 14 
interval estimate 342 
invariance principle 321 
IQR 236 

in boxplot 236 

of Old Faithful data 236 

of Wick temperaures 240 
iris recognition example 1 
isotropy of Poisson process 175 


Janka hardness example 223 
data 224 
estimated regression line 258 
regression model 256 


scatterplot 223, 257, 258 
Jensen’s inequality 107 
joint 


continuous distribution 

bivariate 119 
discrete distribution 115 

of sum and maximum 116 
distribution function 

bivariate 118 

multivariate 122 

relation with marginal 118 
probability density 

bivariate 119 

multivariate 123 

relation with marginal 122 
probability mass function 

bivariate 116 

drawing without replacement 

multivariate 122 

of sum and maximum 116 

jury example 75 


118, 123 


123 


kernel 213 
choice of 217 
Epanechnikovy 213 
normal 213 
triweight 213 
kernel density estimate 215 
bandwidth of 213, 215 
computed for 
exponential data 260 
normal data 260 
Old Faithful data 213, 216, 217 
software data 218 
construction of 215 
example 
software data 255 
with boundary kernel 219 
of software data 218, 255 
killer football example 3 
Kolmogorov-Smirnov distance 277 


large sample confidence interval 353 
law of large numbers 185 
for F, 249 
for empirical quantile 252 
for relative frequency 253 
for sample standard deviation 253 
for sample variance 253 
for the histogram 250 
for the MAD 253 
for the sample mean 249 
strong 187 
law of total probability 31 
leap years 17 
least squares estimates 330 
left critical value 388 
leverage point 337 
likelihood function 
continuous case 317 
discrete case 317 
linearity of expectations 137 
loading a bridge 13 
logistic model 7 
loglikelihood function 319 
lower confidence bound 367 


MAD 234 
law of large numbers for 253 
of a distribution 267 
of Wick temperatures 234 


Index 483 


mad cow disease 30 
marginal 
distribution 117 
distribution function 118 
probability density 122 
probability mass function 117 
maximum likelihood estimator 317 
maximum of random variables 109 
mean see expectation 
mean integrated squared error 212, 
216 
mean squared error 305 
measuring angles 308 
median 66 
of a distribution 267 
of dataset see sample median 
median of absolute deviations see 
MAD 
memoryless property 50, 62 
method of least squares 329 
Michelson, A.A. 181 
minimum variance unbiased estimator 
305 
minimum of random variables 109 
mode 
of dataset 211 
of density 183 


model 
distribution 247 
parameters 247, 285 


validation 76 

Monty Hall quiz example 4,39 
sample space 23 

mortality rate example 405 
data 406 

MSE 305 

“utafewo” rule 185 

multiplication rule 27 

mutually exclusive events 15 


network server example 285, 306 
nonparametric estimate 255 
nonpooled variance 420 
normal distribution 64 
under change of units 106 
bivariate 159 
expectation of 94 
standard 65 
summary of 429 


484 Index 


variance of 97 
null hypothesis 374 


O-rings 5 
observed significance level 387 
Old Faithful example 207 
boxplot 237 
data 207 
empirical bootstrap 275 
empirical distribution function 219, 
254 
empirical quantiles 235 
estimates for f and F254 
five-number summary 236 
histogram 210, 211 
IQR 236 
kernel density estimate 
217, 254 
order statistics 209 
quartiles 236 
sample mean 208 
scatterplot 229 
statistical model 254 
t-test 404 
order statistics 235 
of Old Faithful data 209 
of Wick temperatures 235 
outlier 232 
in boxplot 236 


213, 216, 


p-value 376 
as observed significance level 379, 
387 
one-tailed 390 
relation with critical value 387 
two-tailed 390 
pairwise independent 35 
parameter of interest 286 
parametric bootstrap 276 
for centered sample mean 276 
for KS distance 277 
simulation 
for centered sample mean 277 
for KS distance 278 
Pareto distribution 63, 86, 92 
expectation of 100 
summary of 429 
variance of 100 
percentile 66 


of dataset see empirical percentile 
permutation 14 
physical independence 34 
point estimate 341 
Poisson distribution 170 
expectation of 171 
summary of 429 
variance of 171 
Poisson process 
k-dimensional 174 
higher-dimensional 174 
isotropy of 175 
locations of points 173 
one-dimensional 172 
points of 172 
simulation of 175 
pooled variance 417 
probability 16 
conditional 25, 26 
of aunion 18 
of complement 18 
probability density function 57 
of product XY 160 
of quotient X/Y 161 
ofsum X+Y 156 
probability distribution 43, 59 
probability function 16 
on an infinite sample space 20 
additivity of 16 
probability mass function 43 
joint 
bivariate 116 
multivariate 122 
marginal 117 
ofsum X+Y 152 
products of sample spaces 18 


quantile 

of a distribution 66 

of dataset see empirical quantile 
quartile 

lower 236 

of Old Faithful data 236 

upper 236 


random sample 246 
random variable 
continuous 57 
discrete 42 


realization 
of random sample 247 
of random variable 72 
regression line 257, 329 
estimated 
for Janka hardness data 
intercept of 257, 331 
slope of 257, 331 
regression model 
general 256 
linear 257, 329 
relative efficiency 304 
relative frequency 
law of large numbers for 253 
residence times 26 
residual 332 
response variable 257 
right continuity of F 45 
right critical value 388 
right tail probabilities 377 
of the N(0,1) distribution 65, 345, 
433 
Ross, S.M. 199 
run, in simulation 77 


258, 330 


sample mean 231 

law of large numbers for 249 

of Old Faithful data 208 

of Wick temperatures 231 
sample median 232 

of Wick temperatures 232 
sample space 13 

bridge loading 13 

coin tossing 13 

twice 18 

countably infinite 19 

envelopes 14 

months 13 

products of 18 

uncountable 17 
sample standard deviation 233 

law of large numbers for 253 

of Wick temperatures 233 
sample statistic 249 

and distribution feature 254 
sample variance 233 

law of large numbers for 253 
sampling distribution 289 
scatterplot 221 


Index 485 


of black cherry trees 267 
of drill times 223 
of Janka hardness data 
258 
of Old Faithful data 229 
of Wick temperatures 232 
second moment 98 
serial number analysis 7, 299 
shifted exponential distribution 364 
Shoshoni Indians example 402 
data 403 
significance level 384 
observed 387 
of atest 384 


223, 257, 


simple linear regression 257, 329 
simulation 
of the Poisson process 175 
run 77 


slope of regression line 257 
software reliability example 218 
boxplot 237 
data 218 
empirical distribution function 219, 
256 
estimated exponential 256 
histogram 255 
kernel density estimate 
256 
order statistics 227 
sample mean 255 
solo race example 151 
space shuttle Challenger 5 
speed of light example 9,181 
data 246 
sample mean 256 
speeding 104 
standard deviation 97 
standardizing averages 197 
stationarity 168 
weak 168 
statistical independence 34 
statistical model 
random sample model 247 
simple linear regression model 257, 
329 
stochastic independence 34 
stochastic simulation 71 
strictly convex function 107 
strong law of large numbers 187 


218, 255, 


486 Index 


studentized mean 349, 401 
studentized mean difference 
nonpooled 421 
pooled 417 
sum of squares 329 
sum of two random variables 
binomial 153 
continuous 154 
discrete 151 
exponential 156 
geometric 152 
normal 158 
summary of distributions 429 


t-distribution 348 
ttest 399 
one sample 
large sample 404 
nonnormal data 402 
normal data 401 
test statistic 400 
regression 
intercept 408 
slope 407 
two samples 
large samples 422 
nonnormal with equal variances 
418 
normal with equal variances 417 
with unequal variances 419 
tail probability 
left 377 
right 345, 377 
tank example 7, 299, 373 
telephone 
calls 168 
exchange 168 
test statistic 375 
testing hypotheses 
alternative hypothesis 373 
critical region 386 
critical values 386 
null hypothesis 373 
p-value 376, 386, 390 
relation with confidence intervals 
392 
significance level 384 
test statistic 375 
type error 377,378 


type II error 
tires 8 
total probability, law of 31 
traffic flow 177 
true distribution 247 
true parameter 247 
type lerror 378 

probability of committing 384 
type Terror 378 

probability of committing 391 


378, 390 


UEFA playoffs draw 23 
unbiased estimator 290 
uniform distribution 
expectation of 92, 100 
summary of 429 
variance of 100 
uniform distribution 60 
union of events 14 
univariate dataset 207,210 
upper confidence bound 367 


validation of model 76 
variance 96 
alternative expression 97 
nonpooled 420 
of average 182 
of the sum 
of n random variables 149 
of two random variables 140 
pooled 417 


Weibull distribution 86, 112 
as model for ball-bearings 265 
whisker 236 
Wick temperatures example 231 
boxplot 240 
corrected data 233 
data 231 
five-number summary 240 
MAD 234 
order statistics 235 
sample mean 231 
sample median 232 
sample standard deviation 233 
scatterplot 232 
Wilson method 361 
work in system 83 
worngly spelled words 176 


